About the order that readdir() returns entries in

November 26, 2011

In a mostly unrelated article (seen via Planet Sysadmin) I recently noticed the following:

readdir(3) just returns the directory entries in the order they are linked together, which is also not related to inode numbering but as best as I can tell is from outer leaf inwards (since the most recently created file is listed first).

On most systems, readdir(3) is a modestly warmed over version of the underlying 'read directory entries' system call, and returns directory entries in the same order that the system call does. In theory a Unix kernel can return directory entries in whatever order it wants (including, say, sorted alphabetically). In practice kernels almost always give you directory entries in what I will call 'traversal order', whatever the natural order is for entries in the on-disk data structures that represent a directory.

For a very long time, Unix directories on disk were simple linear arrays (first with fixed-size entries and then with variable sized ones when BSD introduced long filenames). When a new entry was added, it was generally put in the first spot where there was room; at the end if none of the directory's entries had been deleted, and perhaps earlier if a filename of a suitable length had been deleted earlier. The kernel read directories in forward linear order, starting at the first block of the directory and going up, and so returned entries in this order.

(In the original simple Unix filesystems, inode numbers were also allocated in a straightforward 'first free number' order, so the order of directory entry creation could correspond quite well to inode order. The Berkeley FFS changed this somewhat by allocating inodes in a more scattered manner.)

Modern Unix systems commonly use some sort of non-linear directories under at least some circumstances (a linear data structure may still be more efficient for small directories); generally these are some variant of balanced trees. The natural traversal order is tree order, but what that is is very implementation dependent. I believe it's common to hash filenames and then insert entries into the tree in hash order, but hashes (and thus the hash order) vary tremendously between filesystems, and I'm sure that somewhere there is a filesystem that doesn't hash names and just inserts them straight in some order.

(Because this is a per-filesystem thing, it follows that the traversal order can be different for different directories on the same system even if they have the same entries created in the same order, either because the directories are using different filesystem types or just because some parameters were set differently on the different filesystems.)

Written on 26 November 2011.
« Python instance dictionaries, attribute names, and memory use
Why processing things in inode order is a good idea »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Nov 26 01:47:30 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.