2019-03-07
Exploring the mild oddity that Unix pipes are buffered
One of the things that blogging is good for is teaching me that what I think is common knowledge actually isn't. Specifically, when I wrote about a surprisingly arcane little Unix shell pipeline example, I assumed that it was common knowledge that Unix pipes are buffered by the kernel, in addition to any buffering that programs writing to pipes may do. In fact the buffering is somewhat interesting, and in a way it's interesting that pipes are buffered at all.
How much kernel buffering there is varies from Unix to Unix. 4 KB
used to be the traditional size (it was the size on V7, for example,
per the V7 pipe(2)
manpage),
but modern Unixes often have much bigger limits, and if I'm reading
it right POSIX only requires a minimum of 512 bytes. But this isn't
just a simple buffer, because the kernel also guarantees that if
you write PIPE_BUF
bytes or less to a pipe, your write is atomic
and will never be interleaved with other writes from other processes.
(The normal situation on modern Linux is a 64 KB buffer; see the
discussion in the Linux pipe(7)
manpage. The atomicity
of pipe writes goes back to early Unix and is required by POSIX,
and I think POSIX also requires that there be an actual kernel
buffer if you read the write()
specification very
carefully.)
On the one hand this kernel buffering and the buffering behavior makes perfect sense and it's definitely useful. On the other hand it's also at least a little bit unusual. Pipes are a unidirectional communication channel and it's pretty common to have unbuffered channels where a writer blocks until there's a reader (Go channels work this way by default, for example). In addition, having pipes buffered in the kernel commits the kernel to providing a certain amount of kernel memory once a pipe is created, even if it's never read from. As long as the read end of the pipe is open, the kernel has to hold on to anything it allowed to be written into the pipe buffer.
(However, if you write()
more than PIPE_BUF
bytes to a pipe
at once, I believe that the kernel is free to pause your process
without accepting any data into its internal buffer at all, as
opposed to having to copy PIPE_BUF
worth of it in. Note that
blocking large pipe writes by default is a sensible decision.)
Part of pipes being buffered is likely to be due to how Unix evolved
and what early Unix machines looked like. Specifically, V7 and
earlier Unixes ran on single processor machines with relatively
little memory and without complex and capable MMUs (Unix support
for paged virtual memory post-dates V7, and I think wasn't really
available on the PDP-11 line anyway). On top of making the
implementation simpler, using a kernel buffer and allowing processes
to write to it before there is a reader means that a process that
only needs to write a small amount of data to a pipe may be able
to exit entirely before the next process runs, freeing up system
RAM. If writer processes always blocked until someone did a read()
,
you'd have to keep them around until that happened.
(In fact, a waiting process might use more than 4 KB of kernel memory just for various data structures associated with it. Just from a kernel memory perspective you're better off accepting a small write buffer and letting the process go on to exit.)
PS: This may be a bit of a just-so story. I haven't inspected the
V7 kernel scheduler to see if it actually let processes that did a
write()
into a pipe with a waiting reader go on to potentially exit,
or if it immediately suspended them to switch to the reader (or just to
another ready to run process, if any).
Our problem with Netplan and routes on Ubuntu 18.04
Today I've wound up getting back to our netplan dysfunction, so I think it's time to write a blog entry. Spoiler: highly specific network device names and configurations that can only be attached to or specified for a named network device interact very badly at scale.
We have a bunch of internal 'sandbox' networks, which connect to our main server subnet through various routing firewalls. Obviously the core router on our server subnet knows how to route to every sandbox; however, we also like to have the actual servers to have specific sandbox subnet routes to the appropriate routing firewall. For some servers this is merely vaguely nice; for high traffic servers (such as some NFS fileservers) it may be pretty important to avoid a pointless traffic bottleneck on the router. So many years ago we built a too-smart system to automatically generate the appropriate routes for any given host from a central set of information about what subnets were behind which gateway, and we ran it on boot to set things up. The result is a bunch of routing commands:
ip route add 10.63.0.0/16 via 128.100.3.5 ip route add 10.70.0.0/16 via 128.100.3.4 ip route add 172.31.0.0/16 via 128.100.3.6 [...]
This system is completely indifferent to what the local system's
network interface is called, which is good because in our environment
there is a huge assortment of interface names. We have eno1
,
enp3s0f0
, enp4s0f0
, enp4s0
, enp11s0f0
, enp7s0
, enp1s0f0
,
and on and on.
All of this worked great for the better part of a decade, until Ubuntu 18.04 came along with netplan. Netplan has two things that together combine to be quietly nearly fatal to what we want to do. First, the netplan setup on Ubuntu 18.04 will wipe out any 'foreign' routes it finds if and when it is re-run, which happens every so often during things like package upgrades. Second, the 18.04 version of netplan has no way to specify routes that are attached to a subnet instead of a specific named interface. If you want netplan to add extra routes to an interface, you cannot say 'associate the routes with whatever interface is on subnet <X>'; instead, you must associate the routes with an interface called <Y>, for whatever specific <Y> is in use on this system. As mentioned, <Y> is what you could call highly variable across our systems.
(Netplan claims to have some support for wildcards, but I couldn't get it to work and I don't think it ever would because it is wildcarding network interface names alone. Many of our machines have more than one network interface, and obviously only one of them is on the relevant subnet (and most of the others aren't connected to anything).)
The result is that there appears to be no good way for our perfectly
sensible desire for generic routing to interact well with netplan.
In a netplan world it appears that we should be writing and re-writing
a /etc/netplan/02-cslab-routes.yaml
file, but that file has to
have the name of the system's current network interface burned into
it instead of being generic. We do shuffle network interfaces around
every so often (for instance to move a system from 1G to 10G-T),
which would require us remembering that there is an additional magic
step to regenerate this file.
There are various additional problems here too, of course. First,
there appears to be no way to get netplan to redo just your routes
without touching anything else about interfaces, and we very much
want that. Second, on most systems we establish these additional
sandbox routes only after basic networking has come up and we've
NFS mounted our central administrative filesystem that has the data
file on it, which is far too late for normal netplan. I guess we'd
have to rewrite this file and then run 'netplan apply
'.
(Ubuntu may love netplan a whole lot but I certainly hope no one else does.)