The Unix directory problem and the history of directories
October 11, 2010
Back in the beginning, Unix was a surprisingly simple, straightforward, and limited operating system. Many things were implemented in pretty much the most straightforward way possible, including the filesystem and especially including directories.
In V7, directories were very simple. Directory entries were 16-byte structures with 2 bytes for the inode number and 14 for the file name. Directories were just a linear array of some size of these structures (stored in a series of disk blocks like a regular file); unused entries had an inode number of zero. BSD complicated things, but only a bit; when they added long file names in their 'Fast File System' modifications, they changed things only enough to introduce variable length filenames instead of fixed ones (the idea of using fixed-size 256 byte structures when most filenames were much shorter was clearly absurd). Various reimplementations of these same basic concepts, especially Linux's ext2 filesystem, have basically kept the same approach.
The problem with directories in all of the variants of the basic V7 filesystem design is that at their heart, they are linear arrays and thus suffer from all of the general drawbacks of (unsorted) linear arrays. To find a file that's present in a directory the kernel must scan an average of half of the directory; to verify that a file is not present (including before creating it), the kernel must scan all of the directory. Because we're dealing with disks such scans may need to do disk IO, with the attendant delays.
(One effect of this is that large directories hurt much more on a busy system than on a quiet one. On a quiet one with only a few large directories, their data blocks may well stay cached in RAM; on a busy one, that's much less likely due to the higher cache pressure. You can easily reach an overload point that kills performance.)
For the relatively small systems V7 was designed for, this was not a serious issue (or if it was, it was a tolerable one); disks were small, systems were modest, and users were sensible. But as Unix grew, the drawbacks of linear searches that involve disk IO became more and more readily apparent, and Unix vendors started trying to do something about it.
Unfortunately, doing better is complicated (for reasons that deserve another entry all by themselves), and so progress on this for existing filesystems and filesystem designs has generally been somewhat slow. Partly things have been slow because the problem has usually not been urgent; most systems are still relatively small, and Unix users have learned not to use really big directories because performance is really bad.
* * *
Atom feeds are available; see the bottom of most pages.