Why non-linear directories have not been popular in filesystems

October 12, 2010

You might wonder why non-linear directories are so complicated that they haven't been used for most of Unix's history.

Computer science has a lot of data structures that are much better for both lookup and updates than unordered linear arrays. However, pretty much all of them are designed with the assumption that they're living in RAM, and even today RAM has very different characteristics from disks. Specifically, in RAM both fine-grained lookups and fine-grained shuffling of entries are relatively inexpensive. Neither is true for things that are on disk so on disk data structures need to be designed to minimize the amount of IO, especially the amount of random IO. You can do this, but for much of Unix's life this meant that the job was not as simple as implementing a simple and known data structure.

(One wrinkle here is that these data structures should keep working well even in a deliberately hostile environment, one where an attacker is trying to create a worst-case set of directory contents. This is a situation generally not faced by in-RAM algorithms.)

Next come the complications. Pretty much any efficient disk based fast lookup data structure is going to have rebalancing requirements, where adding or removing an entry may require shuffling other entries around. The first complication is that filesystems have to worry about untimely interruptions when they are partway through such rebalancing operations. In fact, they have to stay broadly consistent if such interruptions happen; after all, users will be very unhappy if 'touch file1' can result in the total loss of a large directory because the filesystem was halfway through a major reorganization when a power failure hit.

Second, a filesystem probably wants to allow concurrent updates to directories while a program such as ls is iterating through the directory. Since any update can cause a rebalancing, you then have to at least make sure that the iterating process does not wind up reading from, say, a now-released directory data block. Ideally you want the program to get sensible results despite the actual data structure changing majorly underneath it.

(This is somewhat challenging even for copy on write filesystems because you don't want ls to see a filename in a directory after the file has been deleted or renamed.)

The next question is whether non-linear directories are worth it in general on typical systems. A contiguous linear directory of even semi-modest size can often be completely read in a single disk IO, and after it's in memory a linear scan usually takes trivial time compared to disk IO; the degenerate case is a single-block directory. This means that there is a minimum directory size (in number of entries and on-disk size) below which non-linear directories get you no benefit because searching them takes just as much disk IO as searching a linear directory. How many directories on a typical system are over this size, and how much are they over? In addition Unix systems have usually had very high cache hit ratios for 'hot' directories, which reduces the IO advantage that non-linear directories have.

(There are certainly situations where non-linear directories win big, but these situations tend to be uncommon precisely because those situations have been big performance losers for much of Unix's life.)

Linear directories are not as efficient as better data structures, but they are much, much simpler to implement in a filesystem. Many of these issues have straightforward answers and often simple code will do the right thing. Non-linear directories are generally going to be much more complicated, with much more code and many more issues, and they don't get you very much except in uncommon situations. It should not be a surprise that most Unix filesystem implementors have had higher priorities and most filesystems have used some form of linear directories.

(Honesty in blogging time: I'm not a filesystem designer and I haven't gone looking to find out what the state of the art is here. It's possible that modern designs have found simple and effective solutions to these issues, so the problem isn't as hard as I think it is.)


Comments on this page:

From 70.242.227.231 at 2010-10-12 03:40:55:

Modern filesystems are fond of B+ trees, as are RDBMS's. This is the limit of my knowledge.

Steve

From 65.172.155.230 at 2010-10-12 10:09:18:

You make this sound non-historical ... are there any systems where you have to care about this anymore?

thinks ... I'm not sure about RHEL-3, I'd probably bet it was fine but nnot a lot :). I've no idea about non-zfs on *BSD. Are you using ext2 or turning directory hashing off on ext3, for some reason?

From 76.113.53.175 at 2010-10-12 16:43:30:

Apropos directory tree implementation, check this out:

http://www.cofault.com/2006/02/ext3-magic-more-magic.html
By cks at 2010-10-13 18:45:54:

That link on ext3 is an interesting read; thank you. I wish there were more writeups of the complex bits of balanced directory structures (or at least that I was better at finding them).

It looks like most of our Linux machines have directory indexes turned on, but not all; we have at least one significant Ubuntu 6.06 with it either not available or turned off. Both sides of this sort of surprise me.

I think I'm going to have to revise some of my assumptions about what's efficient and not efficient in modern Unix environments.

(The state of directories on FreeBSD is an interesting question, and I actually do care about it for one or two systems.)

By nothings at 2010-10-15 00:47:34:

One interesting experience I had in Windows was migrating from FAT32 (no balanced trees) to NTFS (balanced trees).

I guess the theory is that large directories on NTFS are supposed to be faster. This is probably true; I've never tested this. And in theory they tuned it so it wouldn't be worse than FAT32 on common cases.

However, the operation "traverse an entire subdirectory hierarchy (enumerating the files)" has gotten much slower. This doesn't require super-large directories; even if you limit it to a complex tree where the leaves have, say, 100 files each, traversing a large tree that's not in cache takes something like 10x as long in NTFS as in FAT32.

Moreover, I've found that if I cache a copy of the directory/file hierarchy, then I can take that cache and update it on the fly by performing 'stat' on every directory and checking the timestamp (in NTFS directories last-update time is changed if a file within them is added/removed), then I can get back the old performance. (In other words, to be explicit: NTFS can stat every directory in such a tree much faster than it can traverse the tree and enumerate every file in the tree.) The downside of this cache is that it is only accurate for file presence/abscence; if a file is modified, the directory update time isn't changed, so my cache cannot accurately reflect file metadata like update time and file size.

By cks at 2010-10-15 11:34:44:

Does directory traversal on Windows (or at least the interface you're using) have some ordering requirement that might make it slower on directories with a more complicated internal structure?

(I suppose that directory traversal could just be slower in general because NTFS's balanced trees requires it to read more disk blocks to get all entries.)

By nothings at 2010-10-18 07:22:48:

It API definitely doesn't require the filesystem return ordered data, although obviously the balanced tree means that the natural traversal probably is ordered anyway. (I switched to the underlying win32 interface instead of using the C wrapper around it because of this sort of possibility, but it made no difference.)

I'm guessing it's not just more disk blocks, but more seeks. One thing notorious about NTFS: on FAT32, if you moved all the files off the drive and moved them back on, you would end up with an entirely defragmented and basically optimally-arranged disk. In NTFS, doing this will result in directories being fragmented to heck.

I suspect the problem is that they just got very second-system-y in thinking about metadata as data and being able to be super-clever and put it anywhere, and in return metadata is generally slower to get to. (They also don't cache it as well as they should, given the cost of reloading it.) E.g. it's possible that the directory traversal API actually returns metadata that I don't need and that requires it to access further additional blocks beyond the main traversal. But there's no other exposed API that asks for less, so I have no way to check.

Written on 12 October 2010.
« The Unix directory problem and the history of directories
Why visited links being visible is important for blog usability »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Tue Oct 12 01:15:48 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.