2014-10-29
Unnoticed nonportability in Bourne shell code (and elsewhere)
In response to my entry on how Bashisms in #!/bin/sh scripts aren't
necessarily bugs, FiL wrote:
If you gonna use bashism in your script why don't you make it clear in the header specifying #!/bin/bash instead [of] #!/bin/sh? [...]
One of the historical hard problems for Unix portability is people writing non-portable code without realizing it, and Bourne shell code is no exception. This is true for even well intentioned people writing code that they want to be portable.
One problem, perhaps the root problem, is that very little you do on Unix will come with explicit (non-)portability warnings and you almost never have to go out of your way to use non-portable features. This makes it very hard to know whether or not you're actually writing portable code without trying to run it on multiple environments. The other problem is that it's often both hard to remember and hard to discover what is non-portable versus what is portable. Bourne shell programming is an especially good example of both issues (partly because Bourne shell scripts often use a lot of external commands), but there have been plenty of others in Unix's past (including 'all the world's a VAX' and all sorts of 64-bit portability issues in C code).
So one answer to FiL's question is that a lot of people are using
bashisms in their scripts without realizing it, just as a lot of
people have historically written non-portable Unix C code without
intending to. They think they're writing portable Bourne shell scripts,
but because their /bin/sh is Bash and nothing in Bash warns about
things the issues sail right by. Then one day you wind up changing
/bin/sh to be Dash and all sorts of bits of the world explode,
sometimes in really obscure ways.
All of this sounds abstract, so let me give you two examples of
accidentally Bashisms I've committed. The first and probably quite
common one is using '==' instead of '=' in '[ ... ]' conditions.
Many other languages use == as their string equality check, so at some
point I slipped and started using it in 'Bourne' shell scripts. Nothing
complained, everything worked, and I thought my shell scripts were fine.
The second I just discovered today. Bourne shell pattern matching allows
character classes, using the usual '[...]' notation, and it even has
negated characters classes. This means that you can write something like
the following to see if an argument has any non-number characters in it:
case "$arg" in *[^0-9]*) echo contains non-number; exit 1;; esac
Actually I lied in that code. Official POSIX Bourne shell doesn't
negate character classes with the usual '^' character that Unix
regular expressions use; instead it uses '!'. But Bash accepts
'^' as well. So I wrote code that used '^', tested it, had it
work, and again didn't realize that I was non-portable.
(Since having a '^' in your character class is not an error in
a POSIX Bourne shell, the failure mode for this one is not a
straightforward error.)
This is also a good example of how hard it is to test for
non-portability, because even when you use 'set -o posix' Bash
still accepts and matches this character class in its way (with
'^' interpreted as class negation). The only way to test or find
this non-portability is to run the script under a different shell
entirely. In fact, the more theoretically POSIX compatible shells
you test on the better.
(In theory you could try to have a perfect memory for what is POSIX compliant and not need any testing at all, or cross-check absolutely everything against POSIX and never make a mistake. In practice humans can't do that any more than they can write or check perfect code all the time.)
2014-10-07
Why blocking writes are a good Unix API (on pipes and elsewhere)
One of the principles of good practical programming is that when your program can't make forward progress, it should do nothing rather than, say, continue to burn CPU while it waits for something to do. You want your program to do what work it can and then generally go to sleep, and thus you want APIs that encourage this to happen by default.
Now consider a chain of programs (or processes or services), each one feeding the next. In a multi-process environment like this you usually want something that gets called 'backpressure', where if any one component gets overloaded or can't make further progress it pushes back on the things feeding it so that they stop in turn (and so on back up the chain until everything quietly comes to a stop, not burning CPU and so on).
(You also want an equivalent for downstream services, where they process any input they get (if they can) but then stop doing anything if they stop getting any input at all.)
I don't think it's a coincidence that this describes classic Unix
blocking IO to both pipes and files. Unix's blocking writes do
backpressure pretty much exactly the way you want to happen; if any
stage in a pipeline stalls for some reason, pretty soon all processes
involved in it will block and sleep in write()s to their output
pipe. Things like disk IO speed limits or slow processing or whatever
will naturally do just what you want. And the Unix 'return what's
available' behavior on reads does the same thing for the downstream
of a stalled process; if the process wrote some output you can
process it, but then you'll quietly go to sleep as you block for
input.
And this is why I think that Unix having blocking pipe writes by default is not just a sensible API decision but a good one. This decision makes pipes just work right.
(Having short reads also makes the implementation of pipes simpler,
because you don't have complex handling in the situation where eg
process B is doing a read() of 128 megabytes while process A is
trying to write() 64 megabytes to it. The kernel can make this
work right, but it needs to go out of its way to do so.)
2014-10-06
Why it's sensible for large writes to pipes to block
Back in this entry I said that large writes to pipes blocking instead of immediately returning with a short write was a sensible API decision. Today let's talk about that, by way of talking about how deciding the other way would be a bad API.
Let's start with a question: in a typical Unix pipeline program like
grep, what would be the sensible reactions to trying to write a large
amount of data returning a short write indicator? This is clearly not
an error that should cause the program to abort (or even to print a
warning); instead it's a perfectly normal thing if you're producing
output faster than the other side of the pipe can consume it. For most
programs, that means the only thing you can really do is pause until you
can write more to the pipe. The conclusion is pretty straightforward;
in a hypothetical world where such too-large pipe writes returned short
write indicators instead of blocking, almost all programs would either
wrap their writes in code that paused and retried them or arrange to set
a special flag on the file descriptor to say 'block me until everything
is written'. Either or both would probably wind up being part of stdio.
If everything is going to have code to work around or deal with something, this suggests that you are picking the wrong default. Thus large writes to pipes blocking by default is the right API decision because it means everyone can write simpler and less error-prone code at the user level.
(There are a number of reasons this is less error-prone, including both
programs that don't usually expect to write to pipes (but you tell them
to write to /dev/stdout) and programs that usually do short writes
that don't block and so don't handle short writes, resulting in silently
not writing some amount of their output some of the time.)
There's actually a reason why this is not merely a sensible API but a good one, but that's going to require an additional entry rather than wedging it in here.
Sidebar: This story does not represent actual history
The description I've written above more or less requires that there is
some way to wait for a file descriptor to become ready for IO, so that
when your write is short you can find out when you can usefully write
more. However there was no such mechanism in early Unixes; select()
only appeared in UCB BSD (and poll() and friends are even later).
This means that having nonblocking pipe writes in V7 Unix would have
required an entire set of mechanisms that only appeared later, instead
of just a 'little' behavior change.
(However I do suspect that the Bell Labs Unix people actively felt
that pipe writes should block just like file writes blocked until
complete, barring some error. Had they felt otherwise, the Unix API
would likely have been set up somewhat differently and V7 might
have had some equivalent of select().)
If you're wondering how V7 could possibly not have something like
select(), note that V7 didn't have any networking (partly because
networks were extremely new and experimental at the time). Without
networking and the problems it brings, there's much less need (or use)
for a select().