Exploring the mild oddity that Unix pipes are buffered
One of the things that blogging is good for is teaching me that what I think is common knowledge actually isn't. Specifically, when I wrote about a surprisingly arcane little Unix shell pipeline example, I assumed that it was common knowledge that Unix pipes are buffered by the kernel, in addition to any buffering that programs writing to pipes may do. In fact the buffering is somewhat interesting, and in a way it's interesting that pipes are buffered at all.
How much kernel buffering there is varies from Unix to Unix. 4 KB
used to be the traditional size (it was the size on V7, for example,
per the V7 pipe(2)
manpage),
but modern Unixes often have much bigger limits, and if I'm reading
it right POSIX only requires a minimum of 512 bytes. But this isn't
just a simple buffer, because the kernel also guarantees that if
you write PIPE_BUF
bytes or less to a pipe, your write is atomic
and will never be interleaved with other writes from other processes.
(The normal situation on modern Linux is a 64 KB buffer; see the
discussion in the Linux pipe(7)
manpage. The atomicity
of pipe writes goes back to early Unix and is required by POSIX,
and I think POSIX also requires that there be an actual kernel
buffer if you read the write()
specification very
carefully.)
On the one hand this kernel buffering and the buffering behavior makes perfect sense and it's definitely useful. On the other hand it's also at least a little bit unusual. Pipes are a unidirectional communication channel and it's pretty common to have unbuffered channels where a writer blocks until there's a reader (Go channels work this way by default, for example). In addition, having pipes buffered in the kernel commits the kernel to providing a certain amount of kernel memory once a pipe is created, even if it's never read from. As long as the read end of the pipe is open, the kernel has to hold on to anything it allowed to be written into the pipe buffer.
(However, if you write()
more than PIPE_BUF
bytes to a pipe
at once, I believe that the kernel is free to pause your process
without accepting any data into its internal buffer at all, as
opposed to having to copy PIPE_BUF
worth of it in. Note that
blocking large pipe writes by default is a sensible decision.)
Part of pipes being buffered is likely to be due to how Unix evolved
and what early Unix machines looked like. Specifically, V7 and
earlier Unixes ran on single processor machines with relatively
little memory and without complex and capable MMUs (Unix support
for paged virtual memory post-dates V7, and I think wasn't really
available on the PDP-11 line anyway). On top of making the
implementation simpler, using a kernel buffer and allowing processes
to write to it before there is a reader means that a process that
only needs to write a small amount of data to a pipe may be able
to exit entirely before the next process runs, freeing up system
RAM. If writer processes always blocked until someone did a read()
,
you'd have to keep them around until that happened.
(In fact, a waiting process might use more than 4 KB of kernel memory just for various data structures associated with it. Just from a kernel memory perspective you're better off accepting a small write buffer and letting the process go on to exit.)
PS: This may be a bit of a just-so story. I haven't inspected the
V7 kernel scheduler to see if it actually let processes that did a
write()
into a pipe with a waiting reader go on to potentially exit,
or if it immediately suspended them to switch to the reader (or just to
another ready to run process, if any).
Comments on this page:
|
|