Filesystems versus general tree structures

July 3, 2022

A while back I read Felix Kohlgrüber's The Tree Structure of File Systems (via). As I read it, it argues that filesystem directories should be a richer and more general data structure, one that could contain both data and children (which would make web paths map better on to the 'filesystem API'). As it happens, I think there is a relatively good reason that directories are organization this way, one beyond simple history.

To put it simply, filesystem directories are optimized for out of memory traversal in order to look up names. Generally their actual representation is some form of data structure filled with <name, reference> pairs, where the reference tells you where (on disk) to find almost all substantive information about the name. In the old days and in basic filesystems, the containing data structure was a list; in modern filesystems, it's usually some form of balanced tree (sometimes once the directory is big enough). Directories are optimized to be searched through with as few disk reads as possible, because disk reads are assumed to be expensive.

(Modern filesystems often put some sufficiently commonly used metadata about each name into the directory as well, because that way you can tell programs about it without them having to a separate disk read, possibly a separate disk read for each name in the directory.)

If you augment this basic data structure with data content associated with the directory (well, the directory name) itself, you don't want to put this new data in line with the lookup data structure, where it will force you to read more data from disk in order to do the same traversals. Instead it's mostly likely going to be referred to separately. There will be one set of disk storage associated with the inventory of the directory's children and a second set of disk storage for the data content. Entities without children will have a 'no such thing' marker for their first sort of disk storage and ones without data will have a 'no such thing' marker for their second. If a lot of filesystem entities only have one or the other (ie, if they're conventional files or directories), then a lot of the time having two slots is a waste of (limited) space in an area where filesystems have traditionally cared about (in the metadata kept for most or all filesystem entities).

More broadly, I think it's a mistake to look at filesystems through the eyes of general tree structures. Filesystems originated in a very constrained situation and continue to be focused on fairly constrained one, one where any indirect reference to something is very slow and the less that you need to access the better.

(It's true that modern memory references have gotten slower and slower relative to raw CPU speed, but they're still faster than even NVMe speeds. Plus, a lot of in memory data structures are still not being designed by programmers to minimize references and pack things as densely as possible, for various reasons.)


Comments on this page:

By Johannes at 2022-07-04 07:57:32:

It's also a non-obvious decision in filesystems (ext4) to put only the name into the directory but all other metadata into the inode. As you mention, that allows to pack the (name, inode) pairs more tightly and thus speeds up any single path traversal. On the other hand, doing `ls -l` will require to follow the references to read the inodes of all entries.

Regarding "general trees", there's no rule that says there must be data at inner nodes and leaves. There are so many different kinds of trees and all have different requirements.

«To put it simply, filesystem directories are optimized for out of memory traversal in order to look up names. Generally their actual representation is some form of data structure filled with <name, reference> pairs, where the reference tells you where (on disk) to find almost all substantive information about the name. In the old days and in basic filesystems, the containing data structure was a list; in modern filesystems, it's usually some form of balanced tree»

That applies only to UNIX filesystems, and the original UNIX filesystem layout was terrible for the high-latency drives of a PDP-11 (or even worse of a PDP-8).

The original UNIX filesystem was simply designed to save memory, which was extremely scarce, and to be as simple as possible (probably also to save memory, I-space memory).

Other than UNIX filesystems there were several others based on hashes or trees; for example the ancient TSS/360 OS from IBM (1967) used VSAM (B-Tree) files even for text editing.

There have been more interesting forms of filesystems, such as keyword based filesystems, where the path components are keywords, so for example the components of "/usr/lib/gcc/x86_64-linux-gnu/11/cc1" can be given in any non ambiguous order or subset, or arbitrary key-value tags can be attached to files (in a Stanford PhD thesis for example).

Consider my brief summary in "Filesystem Structure" in

http://www.sabi.co.uk/Notes/linuxPackageWhy15.html

By Walex at 2022-07-04 19:21:20:

«It's also a non-obvious decision in filesystems (ext4) to put only the name into the directory but all other metadata into the inode.</i>»

Actually "ext4" in particular puts the type of the inode in the directory entry, because that is an immutable characteristics of the inode, unlike size, timestamps, permissions, ownership.

«As you mention, that allows to pack the (name, inode) pairs more tightly and thus speeds up any single path traversal. On the other hand, doing `ls -l` will require to follow the references to read the inodes of all entries.»

If you put mutable i-node metadata in directory entries how can you easily do hard links? :-)

It is actually possible to preserve hard links and have mutable i-node metadata and quick access in the most common case from directory entry to i-node data by having interleaved directory and i-node entries, which works for the common case of singly-linked i-nodes. This is similar to pre-joined, interleaved tablespaces for relational databases ("join attribute clustered tables" in Oracle jargon AFAIK).

Written on 03 July 2022.
« Why reproducible machines didn't used to be a priority (I think)
Why an empty (executable) file is generally true in Unix »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jul 3 22:32:01 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.