2011-03-13
Why programs traditionally used spares files on Unix
Unix has had sparse files for a very long time, and periodically they cause confusion and heartburn. At a low level, the possibility for sparse files is more or less inherent in the ability to have files that are not physically contiguous on disk; once you have non-contiguous files you need a way to map from file offsets to disk blocks, and once you have this you can have sections that don't map to any blocks at all. All of that's fine, but a more interesting question is why programs actually create and use spares files.
One of the ways that programs create spares files is by mistake, where
you accidentally lseek() past the end of the file and write something
out there. A slight variant on this is to have a file truncated out from
underneath you; in many cases, when you next write a block, it will go
at the old end of file position and immediately create a sparse file up
to that point. One classic way to do this is, roughly:
someprogram >somefile & tail -f somefile ^C someprogram >somefile &
Oops.
The usual reason to use sparse files deliberately is use them as sparse
'address' space, typically by low-level database libraries. Imagine that
you have a database system that can condense a record key to a 16-bit
number (obviously with collisions). Rather than have lookup tables and
so on to determine the seek offset for the record for any particular
key, you can just map keys directly to, say, a 16 Kbyte section of your
storage file. Even though this results in seek offsets that are all over
the map (including very far into the file), sparse files mean that you
only allocate disk blocks for the keys and records that actually exist
(ie that you have written). People may get confused and alarmed at
ls's output, but that's a minor issue.
(There are various ways to make linear scans of the keyspace still work.)
These days, using sparse files this way is a little bit out of fashion for various reasons, including that people have noticed that good support for sparse files is a little bit spotty; every so often you trip over another program that explodes under some circumstances. (Backup systems are an especially popular thing to have blow up.)
2011-03-02
The POSIX shell and the three sorts of Unixes
Avery Pennarun recently wrote Insufficiently known POSIX shell features, where he talked about a number of nice shell things that are not Bash but are instead POSIX shell features. Although he footnoted this in his entry, I want to draw your attention to how there are three sorts of Unix machines (or, well, Unixes):
- machines on which
/bin/shis Bash. - machines on which
/bin/shis POSIX-compatible but is not Bash. - machines which have a POSIX-compatible shell, but it is not
/bin/sh.
Every so often some well intentioned person attempts to transition the first sort of Unix into the second sort. Busy sysadmins usually immediately reverse the transition because we have better things to do with our time than debug Bashisms that have crept into administrative scripts or, worse, explain to users why their shell scripts just broke and how no we are not going to do anything about it although we could because it is good for them, honest.
(Rewriting shell scripts to pointlessly avoid Bashisms is the very opposite of productive work. It's even less productive than browsing Slashdot, because there is some vague chance that you could learn something from Slashdot.)
The third sort of Unix is a pain in the rear for everyone. In many
respects it might as well not have a POSIX shell, because you can't
easily use it in portable scripts. If you are a big project like
redo you can work around the
difference and find yourself the right shell, but if you are an ordinary
person writing cross-machine shell scripts, ones that you want to run
without an installer step, well, you lose. Your scripts all start with
'#!/bin/sh' because that's the only reliable cross-Unix name for a
Bourne shell, so you can't count on POSIX features.
Fortunately the third sort of Unix is mostly dying out. The largest holdout in our environment is Solaris, which we don't let users log on to and barely run anything on. Even then, the differences sometimes get to us.
Honestly, I suggest that you ignore the third sort of Unix unless you can't because you have one. And if you want to write portable POSIX shell scripts, make sure that you use the second sort of Unix right from the start. (Or that someone involved in the project does.)