Why the popen()
API works but more complex versions blow up
Years ago I wrote about a long-standing Unix issue with more
sophisticated versions of popen()
; my specific
example was writing a large amount of stuff to a subprogram through
a pipe and then reading its output, where both sides stall trying
to write to full pipes. Of course this is not the only way to have
this problem bite you, so recently I ran across Andrew Jorgensen's
A Tale of Two Pipes (via), where
the same problem comes up when a subprogram writes to both standard
output and standard error and you consume them one at a time.
Things like Python's subprocess
module and many other
imitators generally trace their core idea back to the venerable Unix
popen(3)
library function, which first appeared in V7 Unix.
However, popen()
itself does not actually have this problem; only
more sophisticated and capable interfaces based on it do.
The reason popen()
doesn't have the problem is straightforward
and points to the core problem with more elaborated versions of the
API. popen()
doesn't have a problem because it only gives you
a single IO stream, either the sub-program's standard input or its
standard output. More sophisticated APIs give you multiple streams,
and multiple streams are where you get into trouble. You get into
trouble because more sophisticated APIs with multiple streams are
implicitly pretending that the streams can be dealt with independently
and serially, ie that you can fully process one stream before looking
at another one at all. As A Tale of Two Pipes makes clear, this
is not so. In actuality the streams are inter-dependent and have
to be processed together, although Unix pipe buffers can hide this
from you for a while.
Of course you can handle the streams properly yourself, resorting
to poll()
or some similar measure. But you shouldn't have to
remember to do that, partly because as long as you have to take
additional complex steps to make things work right, people are going
to be forgetting this requirement. In the name of looking simple
and generic, these APIs have armed a gun that is pointed straight
at your feet. A more honest API would make the inter-dependency
clear, perhaps by returning a Subprocess
object that you registered
callbacks on. Callbacks have a bad reputation but they at least
make it clear that things can (and will) happen concurrently, instead
of one stream being fully handled before another stream is even
touched.
(Go has an interesting approach to the problem that is sort of half
solution and half not. In its core os/exec
API for this, you you provide streams which
will be read from or written to asynchronously. However there are
helper methods
that give you a more traditional 'here is a stream' interface and
with it the traditional problems.)
Sidebar: Why people keep creating these flawed subprogram APIs on Unix
These APIs keep getting created because they're attractive. How the
API appears to behave (ie, without the deadlock issues) is how
people often want to deal with subprograms. Most of the time you're
not interacting with them step by step, sending in some input and
collecting some output; instead you're sending in the input,
collecting the output, and maybe collecting standard error as well
in case something blew up. People don't want to write poll()
based
loops or callbacks or anything complicated, because concurrency is at
least annoying. They just want the simple API to work.
Possibly libraries should make the straightforward user code work by handling all of the polling and so on internally and being willing to buffer unlimited amounts of standard output and standard error. This would probably blow up less often than the current scheme does, and you could provide various options for how much to buffer and how to deal with overflow for advanced users.
|
|