Wandering Thoughts archives

2014-10-29

Unnoticed nonportability in Bourne shell code (and elsewhere)

In response to my entry on how Bashisms in #!/bin/sh scripts aren't necessarily bugs, FiL wrote:

If you gonna use bashism in your script why don't you make it clear in the header specifying #!/bin/bash instead [of] #!/bin/sh? [...]

One of the historical hard problems for Unix portability is people writing non-portable code without realizing it, and Bourne shell code is no exception. This is true for even well intentioned people writing code that they want to be portable.

One problem, perhaps the root problem, is that very little you do on Unix will come with explicit (non-)portability warnings and you almost never have to go out of your way to use non-portable features. This makes it very hard to know whether or not you're actually writing portable code without trying to run it on multiple environments. The other problem is that it's often both hard to remember and hard to discover what is non-portable versus what is portable. Bourne shell programming is an especially good example of both issues (partly because Bourne shell scripts often use a lot of external commands), but there have been plenty of others in Unix's past (including 'all the world's a VAX' and all sorts of 64-bit portability issues in C code).

So one answer to FiL's question is that a lot of people are using bashisms in their scripts without realizing it, just as a lot of people have historically written non-portable Unix C code without intending to. They think they're writing portable Bourne shell scripts, but because their /bin/sh is Bash and nothing in Bash warns about things the issues sail right by. Then one day you wind up changing /bin/sh to be Dash and all sorts of bits of the world explode, sometimes in really obscure ways.

All of this sounds abstract, so let me give you two examples of accidentally Bashisms I've committed. The first and probably quite common one is using '==' instead of '=' in '[ ... ]' conditions. Many other languages use == as their string equality check, so at some point I slipped and started using it in 'Bourne' shell scripts. Nothing complained, everything worked, and I thought my shell scripts were fine.

The second I just discovered today. Bourne shell pattern matching allows character classes, using the usual '[...]' notation, and it even has negated characters classes. This means that you can write something like the following to see if an argument has any non-number characters in it:

case "$arg" in
   *[^0-9]*) echo contains non-number; exit 1;;
esac

Actually I lied in that code. Official POSIX Bourne shell doesn't negate character classes with the usual '^' character that Unix regular expressions use; instead it uses '!'. But Bash accepts '^' as well. So I wrote code that used '^', tested it, had it work, and again didn't realize that I was non-portable.

(Since having a '^' in your character class is not an error in a POSIX Bourne shell, the failure mode for this one is not a straightforward error.)

This is also a good example of how hard it is to test for non-portability, because even when you use 'set -o posix' Bash still accepts and matches this character class in its way (with '^' interpreted as class negation). The only way to test or find this non-portability is to run the script under a different shell entirely. In fact, the more theoretically POSIX compatible shells you test on the better.

(In theory you could try to have a perfect memory for what is POSIX compliant and not need any testing at all, or cross-check absolutely everything against POSIX and never make a mistake. In practice humans can't do that any more than they can write or check perfect code all the time.)

UnnoticedNonportability written at 00:43:47; Add Comment

2014-10-07

Why blocking writes are a good Unix API (on pipes and elsewhere)

One of the principles of good practical programming is that when your program can't make forward progress, it should do nothing rather than, say, continue to burn CPU while it waits for something to do. You want your program to do what work it can and then generally go to sleep, and thus you want APIs that encourage this to happen by default.

Now consider a chain of programs (or processes or services), each one feeding the next. In a multi-process environment like this you usually want something that gets called 'backpressure', where if any one component gets overloaded or can't make further progress it pushes back on the things feeding it so that they stop in turn (and so on back up the chain until everything quietly comes to a stop, not burning CPU and so on).

(You also want an equivalent for downstream services, where they process any input they get (if they can) but then stop doing anything if they stop getting any input at all.)

I don't think it's a coincidence that this describes classic Unix blocking IO to both pipes and files. Unix's blocking writes do backpressure pretty much exactly the way you want to happen; if any stage in a pipeline stalls for some reason, pretty soon all processes involved in it will block and sleep in write()s to their output pipe. Things like disk IO speed limits or slow processing or whatever will naturally do just what you want. And the Unix 'return what's available' behavior on reads does the same thing for the downstream of a stalled process; if the process wrote some output you can process it, but then you'll quietly go to sleep as you block for input.

And this is why I think that Unix having blocking pipe writes by default is not just a sensible API decision but a good one. This decision makes pipes just work right.

(Having short reads also makes the implementation of pipes simpler, because you don't have complex handling in the situation where eg process B is doing a read() of 128 megabytes while process A is trying to write() 64 megabytes to it. The kernel can make this work right, but it needs to go out of its way to do so.)

BlockingWritesAndBackpressure written at 00:22:09; Add Comment

2014-10-06

Why it's sensible for large writes to pipes to block

Back in this entry I said that large writes to pipes blocking instead of immediately returning with a short write was a sensible API decision. Today let's talk about that, by way of talking about how deciding the other way would be a bad API.

Let's start with a question: in a typical Unix pipeline program like grep, what would be the sensible reactions to trying to write a large amount of data returning a short write indicator? This is clearly not an error that should cause the program to abort (or even to print a warning); instead it's a perfectly normal thing if you're producing output faster than the other side of the pipe can consume it. For most programs, that means the only thing you can really do is pause until you can write more to the pipe. The conclusion is pretty straightforward; in a hypothetical world where such too-large pipe writes returned short write indicators instead of blocking, almost all programs would either wrap their writes in code that paused and retried them or arrange to set a special flag on the file descriptor to say 'block me until everything is written'. Either or both would probably wind up being part of stdio.

If everything is going to have code to work around or deal with something, this suggests that you are picking the wrong default. Thus large writes to pipes blocking by default is the right API decision because it means everyone can write simpler and less error-prone code at the user level.

(There are a number of reasons this is less error-prone, including both programs that don't usually expect to write to pipes (but you tell them to write to /dev/stdout) and programs that usually do short writes that don't block and so don't handle short writes, resulting in silently not writing some amount of their output some of the time.)

There's actually a reason why this is not merely a sensible API but a good one, but that's going to require an additional entry rather than wedging it in here.

Sidebar: This story does not represent actual history

The description I've written above more or less requires that there is some way to wait for a file descriptor to become ready for IO, so that when your write is short you can find out when you can usefully write more. However there was no such mechanism in early Unixes; select() only appeared in UCB BSD (and poll() and friends are even later). This means that having nonblocking pipe writes in V7 Unix would have required an entire set of mechanisms that only appeared later, instead of just a 'little' behavior change.

(However I do suspect that the Bell Labs Unix people actively felt that pipe writes should block just like file writes blocked until complete, barring some error. Had they felt otherwise, the Unix API would likely have been set up somewhat differently and V7 might have had some equivalent of select().)

If you're wondering how V7 could possibly not have something like select(), note that V7 didn't have any networking (partly because networks were extremely new and experimental at the time). Without networking and the problems it brings, there's much less need (or use) for a select().

BlockingLargePipeWrites written at 01:03:58; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.