Exploring the mild oddity that Unix pipes are buffered

March 7, 2019

One of the things that blogging is good for is teaching me that what I think is common knowledge actually isn't. Specifically, when I wrote about a surprisingly arcane little Unix shell pipeline example, I assumed that it was common knowledge that Unix pipes are buffered by the kernel, in addition to any buffering that programs writing to pipes may do. In fact the buffering is somewhat interesting, and in a way it's interesting that pipes are buffered at all.

How much kernel buffering there is varies from Unix to Unix. 4 KB used to be the traditional size (it was the size on V7, for example, per the V7 pipe(2) manpage), but modern Unixes often have much bigger limits, and if I'm reading it right POSIX only requires a minimum of 512 bytes. But this isn't just a simple buffer, because the kernel also guarantees that if you write PIPE_BUF bytes or less to a pipe, your write is atomic and will never be interleaved with other writes from other processes.

(The normal situation on modern Linux is a 64 KB buffer; see the discussion in the Linux pipe(7) manpage. The atomicity of pipe writes goes back to early Unix and is required by POSIX, and I think POSIX also requires that there be an actual kernel buffer if you read the write() specification very carefully.)

On the one hand this kernel buffering and the buffering behavior makes perfect sense and it's definitely useful. On the other hand it's also at least a little bit unusual. Pipes are a unidirectional communication channel and it's pretty common to have unbuffered channels where a writer blocks until there's a reader (Go channels work this way by default, for example). In addition, having pipes buffered in the kernel commits the kernel to providing a certain amount of kernel memory once a pipe is created, even if it's never read from. As long as the read end of the pipe is open, the kernel has to hold on to anything it allowed to be written into the pipe buffer.

(However, if you write() more than PIPE_BUF bytes to a pipe at once, I believe that the kernel is free to pause your process without accepting any data into its internal buffer at all, as opposed to having to copy PIPE_BUF worth of it in. Note that blocking large pipe writes by default is a sensible decision.)

Part of pipes being buffered is likely to be due to how Unix evolved and what early Unix machines looked like. Specifically, V7 and earlier Unixes ran on single processor machines with relatively little memory and without complex and capable MMUs (Unix support for paged virtual memory post-dates V7, and I think wasn't really available on the PDP-11 line anyway). On top of making the implementation simpler, using a kernel buffer and allowing processes to write to it before there is a reader means that a process that only needs to write a small amount of data to a pipe may be able to exit entirely before the next process runs, freeing up system RAM. If writer processes always blocked until someone did a read(), you'd have to keep them around until that happened.

(In fact, a waiting process might use more than 4 KB of kernel memory just for various data structures associated with it. Just from a kernel memory perspective you're better off accepting a small write buffer and letting the process go on to exit.)

PS: This may be a bit of a just-so story. I haven't inspected the V7 kernel scheduler to see if it actually let processes that did a write() into a pipe with a waiting reader go on to potentially exit, or if it immediately suspended them to switch to the reader (or just to another ready to run process, if any).

Comments on this page:

If a pipe write doesn’t trigger a task switch I’d assume it’s not for memory savings from allowing the process to exit but for overhead reduction from, well, reduced task switching. (Esp. the older the architecture is for which the kernel is written.) Go channels are in-process so it’s a different situation. Intuitively I’d be more surprised to see a kernel IPC stream interface without buffering than one with, though I can’t articulate the why very well beyond a hand-waved “overhead reduction”. I wonder if the actual V7 code bears this out.

Huh. That is an interesting question...

Linux used per-process virtual addresses from the start. The kernel could not just memcpy() from one process to another. There are primitives copy_from_user and copy_to_user. From that point of view, you had to use a kernel buffer. (Otherwise, I guess you would have to implement some special case cross-process mapping code, that is not used anywhere else?). And then enforcing the unbuffered semantics would have been less natural: extra complexity without obvious gain.

Did MMU-less UNIX implement, or consider implementing, swapping processes to disk? That would be an equivalent reason, that you should avoid directly reading or writing the peer process's memory.

By cks at 2019-03-08 11:25:06:

To be clear, the PDP-11 series had virtual memory from the start (with memory mapping and memory protection); it just didn't have paged virtual memory. Research Unix swapped whole programs in and out as needed, and you're right, a swapped out process would complicate trying to copy data directly to it. Clearly you're going to need a kernel buffer some of the time so you might as well use it all of the time, especially since Research Unix very much believed in simple solutions.

(According to Dennis Ritchie's information, very early versions of Unix did run on a PDP-11/20 without any memory protection. It was apparently about as much of a pain for a multi-user machine as you'd expect.)

Written on 07 March 2019.
« Our problem with Netplan and routes on Ubuntu 18.04
Exploring how and why interior pointers in Go keep entire objects alive »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Mar 7 22:43:42 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.