Why programs traditionally used sparse files on Unix

March 13, 2011

Unix has had sparse files for a very long time, and periodically they cause confusion and heartburn. At a low level, the possibility for sparse files is more or less inherent in the ability to have files that are not physically contiguous on disk; once you have non-contiguous files you need a way to map from file offsets to disk blocks, and once you have this you can have sections that don't map to any blocks at all. All of that's fine, but a more interesting question is why programs actually create and use sparse files.

One of the ways that programs create sparse files is by mistake, where you accidentally lseek() past the end of the file and write something out there. A slight variant on this is to have a file truncated out from underneath you; in many cases, when you next write a block, it will go at the old end of file position and immediately create a sparse file up to that point. One classic way to do this is, roughly:

someprogram >somefile &
tail -f somefile
someprogram >somefile &


The usual reason to use sparse files deliberately is use them as sparse 'address' space, typically by low-level database libraries. Imagine that you have a database system that can condense a record key to a 16-bit number (obviously with collisions). Rather than have lookup tables and so on to determine the seek offset for the record for any particular key, you can just map keys directly to, say, a 16 Kbyte section of your storage file. Even though this results in seek offsets that are all over the map (including very far into the file), sparse files mean that you only allocate disk blocks for the keys and records that actually exist (ie that you have written). People may get confused and alarmed at ls's output, but that's a minor issue.

(There are various ways to make linear scans of the keyspace still work.)

These days, using sparse files this way is a little bit out of fashion for various reasons, including that people have noticed that good support for sparse files is a little bit spotty; every so often you trip over another program that explodes under some circumstances. (Backup systems are an especially popular thing to have blow up.)

Written on 13 March 2011.
« Trust betrayed: a story of modern email
Why growing IPv6 usage is going to be fun, especially for sysadmins »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Mar 13 03:32:19 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.