Exploring the mild oddity that Unix pipes are buffered

March 7, 2019

One of the things that blogging is good for is teaching me that what I think is common knowledge actually isn't. Specifically, when I wrote about a surprisingly arcane little Unix shell pipeline example, I assumed that it was common knowledge that Unix pipes are buffered by the kernel, in addition to any buffering that programs writing to pipes may do. In fact the buffering is somewhat interesting, and in a way it's interesting that pipes are buffered at all.

How much kernel buffering there is varies from Unix to Unix. 4 KB used to be the traditional size (it was the size on V7, for example, per the V7 pipe(2) manpage), but modern Unixes often have much bigger limits, and if I'm reading it right POSIX only requires a minimum of 512 bytes. But this isn't just a simple buffer, because the kernel also guarantees that if you write PIPE_BUF bytes or less to a pipe, your write is atomic and will never be interleaved with other writes from other processes.

(The normal situation on modern Linux is a 64 KB buffer; see the discussion in the Linux pipe(7) manpage. The atomicity of pipe writes goes back to early Unix and is required by POSIX, and I think POSIX also requires that there be an actual kernel buffer if you read the write() specification very carefully.)

On the one hand this kernel buffering and the buffering behavior makes perfect sense and it's definitely useful. On the other hand it's also at least a little bit unusual. Pipes are a unidirectional communication channel and it's pretty common to have unbuffered channels where a writer blocks until there's a reader (Go channels work this way by default, for example). In addition, having pipes buffered in the kernel commits the kernel to providing a certain amount of kernel memory once a pipe is created, even if it's never read from. As long as the read end of the pipe is open, the kernel has to hold on to anything it allowed to be written into the pipe buffer.

(However, if you write() more than PIPE_BUF bytes to a pipe at once, I believe that the kernel is free to pause your process without accepting any data into its internal buffer at all, as opposed to having to copy PIPE_BUF worth of it in. Note that blocking large pipe writes by default is a sensible decision.)

Part of pipes being buffered is likely to be due to how Unix evolved and what early Unix machines looked like. Specifically, V7 and earlier Unixes ran on single processor machines with relatively little memory and without complex and capable MMUs (Unix support for paged virtual memory post-dates V7, and I think wasn't really available on the PDP-11 line anyway). On top of making the implementation simpler, using a kernel buffer and allowing processes to write to it before there is a reader means that a process that only needs to write a small amount of data to a pipe may be able to exit entirely before the next process runs, freeing up system RAM. If writer processes always blocked until someone did a read(), you'd have to keep them around until that happened.

(In fact, a waiting process might use more than 4 KB of kernel memory just for various data structures associated with it. Just from a kernel memory perspective you're better off accepting a small write buffer and letting the process go on to exit.)

PS: This may be a bit of a just-so story. I haven't inspected the V7 kernel scheduler to see if it actually let processes that did a write() into a pipe with a waiting reader go on to potentially exit, or if it immediately suspended them to switch to the reader (or just to another ready to run process, if any).

Written on 07 March 2019.
« Our problem with Netplan and routes on Ubuntu 18.04
Exploring how and why interior pointers in Go keep entire objects alive »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Mar 7 22:43:42 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.