Why "process substitution" is a late feature in Unix shells

January 2, 2022

A while ago, I read Julia Evans' Teaching by filling in knowledge gaps and hit the section using Evans' shell brackets cheat sheet as an example. One of the uses of brackets in Bash and other shells is "process substitution" (also Wikipedia), where you can use a redirection with a process instead of a file as an argument to commands:

diff <(rpm -qa) <(ssh server2 "rpm -qa")

Process substitution is a great little feature and it feels very Unixy, but it took a surprisingly long time to appear in Unix and in shells. This is because it needed a crucial innovation, namely names in the filesystem for file descriptors, names that you can open() to be connected to the file descriptor.

Standard input, standard output, and so on are file descriptors, which (from the view of Unix processes) are small integers that refer to open files, pipes, network connections, and other things that fall inside the Unix IO model. File descriptors are specific to each process and are an API between processes and the kernel, where the process tells the kernel that it wants to read from (eg) file descriptor zero and the kernel provides it whatever is there. Conventionally, Unix processes are started with three file descriptors already open, those being standard input (fd 0), standard output (fd 1), and standard error (fd 2). However, you can start processes with more file descriptors already open and connected to something if you want to.

Normal Unix programs don't expect to be passed any extra file descriptors and there's no standard approach in Unix for telling them that hey have been given extra file descriptors and they should read or write to them for some purpose. Instead, famously, Unix programs like diff expect to be provided file names as arguments, and then they open the file names themselves. Some programs accept a special file name (often '-', a single dash) to mean that they should read from standard input or write to standard output, but this is only a convention; there's no actual '-' filename that you can open yourself.

To implement process substitution, the shell needs to bridge these two different worlds. The process substitution commands will write to their standard output, but the overall command must be given file names as input. There are two ways to implement this, the inefficient one that's been possible since the beginning of Unix, and the efficient one that became possible later. The inefficient way is to write the output of the commands to a file, turning the whole thing into something like this:

rpm -qa >/tmp/file-a.$$
ssh server2 "rpm -qa" >/tmp/file-b.$$
diff /tmp/file-a.$$ /tmp/file-b.$$
rm /tmp/file-a.$$ /tmp/file-b.$$

I believe that some Unix shells may have implemented this, but it was never very popular for various reasons (especially since this was back in the days when /tmp was generally on a slow hard disk). Once named FIFOs were available on Unixes, you could use them instead of actual files, which improved the efficiency but still had some issues.

The best way is to have filesystem names for file descriptors, so that when you open the filename, you're connected to the file descriptor (you may or may not get that file descriptor returned by the kernel from open()). Then the shell can start the diff process with some extra file descriptors open that are the input sides of the pipes that the two process substitution commands are writing their output too, and it can provide the filesystem names for these file descriptors as command line arguments to diff. Diff thinks it's operating on files (although odd ones, since they're not seekable among other issues), and generally it will be happy. Everything is automatically cleaned up when things exit and it's about as efficient as you could ask for. The conventional modern filesystem name for file descriptors is /dev/fd/N (for file descriptor N).

I think every modern Unix has a /dev/fd of some sort (although the implementations vary), but coming up with the idea of /dev/fd, having it implemented, and then having it spread widely enough that shells could reliably use it took a while. My impression is that process substitution in shells didn't start to be common until then, and even today isn't necessarily in wide use.

(Unfortunately I'm not sure where /dev/fd was first invented and introduced. It's possible that it comes from later versions of Research Unix, since the V10 version of rc apparently had this and I can't imagine the Bell Labs people implementing it with named FIFOs. /dev/fd itself took some Unix innovations after V7, but that's for another entry.)

PS: Considering that Bash apparently had process substitution no later than 1994, my standards for a 'late shell feature' may be a bit off from many people's. However, I think process substitution is still not in the shell section of the current version of POSIX, although named FIFOs are.

Written on 02 January 2022.
« Why I'm not interested in rolling back to snapshots of Linux root filesystems
The important Unix idea of the "virtual filesystem switch" »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jan 2 23:22:37 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.