2011-03-13
Why programs traditionally used sparse files on Unix
Unix has had sparse files for a very long time, and periodically they cause confusion and heartburn. At a low level, the possibility for sparse files is more or less inherent in the ability to have files that are not physically contiguous on disk; once you have non-contiguous files you need a way to map from file offsets to disk blocks, and once you have this you can have sections that don't map to any blocks at all. All of that's fine, but a more interesting question is why programs actually create and use sparse files.
One of the ways that programs create sparse files is by mistake, where
you accidentally lseek()
past the end of the file and write something
out there. A slight variant on this is to have a file truncated out from
underneath you; in many cases, when you next write a block, it will go
at the old end of file position and immediately create a sparse file up
to that point. One classic way to do this is, roughly:
someprogram >somefile & tail -f somefile ^C someprogram >somefile &
Oops.
The usual reason to use sparse files deliberately is use them as sparse
'address' space, typically by low-level database libraries. Imagine that
you have a database system that can condense a record key to a 16-bit
number (obviously with collisions). Rather than have lookup tables and
so on to determine the seek offset for the record for any particular
key, you can just map keys directly to, say, a 16 Kbyte section of your
storage file. Even though this results in seek offsets that are all over
the map (including very far into the file), sparse files mean that you
only allocate disk blocks for the keys and records that actually exist
(ie that you have written). People may get confused and alarmed at
ls
's output, but that's a minor issue.
(There are various ways to make linear scans of the keyspace still work.)
These days, using sparse files this way is a little bit out of fashion for various reasons, including that people have noticed that good support for sparse files is a little bit spotty; every so often you trip over another program that explodes under some circumstances. (Backup systems are an especially popular thing to have blow up.)