Why Unix kernels have grown caches for directory entries ('name caches')

December 2, 2023

An interesting feature of modern Unix kernels is that they generally know the names of things like current directories and open files. Traditionally the only thing Unix knew about open files, current directories, active memory mapped files, and so on was their inode (as an in-kernel data structure, including pointers to the inode's mount point and so on). However, some time back various Unixes added in kernel caches of directory entry names and associated data (in Linux these are dentries and the dcache; in FreeBSD there is the name cache). Once a Unix kernel had such a general cache, it could pin all of the entries for active file and directory objects and so generally be able to supply their names, either for system monitoring purposes (such as Linux's /proc/<pid>/fd subdirectory) or so they could support a system call to return the name of the current directory if it had one.

The reason that several Unixes all added these name caches is straightforward; running Unix systems generally do a lot of directory name lookups. The steady addition of shared libraries (which may live in a number of different places), data files for locales and timezones, lots of $PATH entries, and so on didn't improve the situation. Before name caches, each of these lookups had to call into the specific filesystem, which would generally check through whatever the on-disk data structure for directories was; hopefully the actual disk blocks for these directories would already be in the kernel's disk cache, so they didn't have to be read in.

A kernel name cache provides a fast path for all of these lookups. This cache is especially useful for looking up things that are almost certainly already in active use, such as /bin/sh, the core shared library loader, or the C shared library. These are almost always in memory already, so with the right efficient in-memory data structures for name caches, the kernel can go from "/bin/sh" to an inode quite efficiently (and directly, without having to do a bunch of indirection through things like its Virtual Filesystem Switch.

An explicit kernel name cache also has the additional benefit that it can store negative entries (in Linux, negative dentries), which say that a particular name isn't present. There are a fair number of situations on modern Unixes where programs will attempt to find a file in a succession of directories; with negative entries, those checks of all of the directories that the file isn't in can still be pretty efficient. Without some sort of support for 'this name is definitely not here' in the name cache, the kernel would have no choice but to ask the filesystem to search the on-disk directory for the name.

I don't know if there are performance studies for current name caches in current Unix kernels, but I'm sure that they make a real difference (both in lookup speed and in reducing kernel CPU usage). Even in the late 1980s, name lookups were a quite common thing and they relied very heavily on high hit rates in the kernel block cache (I was once involved in studying this in a BSD derived kernel, and I remember hit rates in the high 90%s).

(An interesting read on kernel name translation overhead and optimizing it is the relevant sections in the 4.4 BSD Lite "System Performance" paper.)

PS: Since I looked it up, all of Linux, FreeBSD, OpenBSD, and NetBSD have some form of kernel name caches. I don't know about Illumos or the few surviving commercial Unixes.

Sidebar: The (potential) names of filesystem objects

A directory in a conventional Unix filesystem has either one name or no name (if it's been removed). Because of this, the kernel's name cache can always know the directory's current name if it has one. If it wants to, the name cache can go further and provide the last name that the directory was known by before it was deleted, along with a mark that it was deleted.

A file can have no name (if it's been removed since it was opened or mmap()'d), it can have one name, or it can have several names because there are several hardlinks to it. Because of this the kernel name cache may not necessarily know the current name of an open file. If it started out having multiple hardlinks, was opened through one hardlink, and then that hardlink was removed, the name cache may not know the name of the other remaining hardlink(s).

Even if the name cache does know other names for the file, it's a policy decision if the name cache should provide them or if it should return the original name the file was opened under, along with an indication that the name was removed. In at least some implementations of /proc/<pid>/fd or the equivalent, you can still read the data of now-deleted files, so you don't need a current name to do this and knowing the original now-deleted name the program used may be more useful than knowing a current alternate name.


Comments on this page:

By Konrad at 2023-12-03 12:53:13:

I don't know if there are performance studies for current name caches in current Unix kernels, but I'm sure that they make a real difference (both in lookup speed and in reducing kernel CPU usage).

QNX does not have a name cache—of course not in the microkernel, though pathmgr is actually a userspace process. While message passing is quite fast, compiling programs is just about the worst thing one can do for performance. The company used to ship native compilers, but got rid of all "QNX desktop" things around 2010; even the official binaries are all cross-compiled from Linux (visible by running "strings" on them).

I don't recall exact numbers, but the performance difference when I tested this was shocking. Anyone who still has a QNX version with gcc and make could see this by building a large project and comparing against Linux. The QNX System Profiler will show the impact of name lookups; or just run under "strace -f" on LInux, check out the number of "openat" calls, and keep in mind that each call requires a lookup per path component. Linux runs a "make" many times faster, and dentry caching is the #1 reason—particularly in its negative form, given the way gcc looks up headers and libraries (no readdir() into a nice data structure, just a bunch of "open" calls till one works; this effectively offloads optimization onto the OS, and it works great on most).

There's at least one "openqnx" mirror on Github, by the way, for those who are curious about the details. (Probably unauthorized, and not "open" in the usual meaning of "open source"—"You must obtain a written license from and pay applicable license fees to QNX Software Systems before you may reproduce, modify or distribute this software". But it's sourced from code that was officially published and later withdrawn, not some illegal leak.)

By Konrad at 2023-12-03 13:13:00:

A directory in a conventional Unix filesystem has either one name or no name (if it's been removed).

Unrelated to my previous comment, do any systems allow files to be created in unnamed directories? Linux doesn't; rmdir only works on an empty directory, and though it's capable of deleting the current working directory (and /proc remembers the former name), no files can then be created in it.

Now that we have openat(), it'd allow a convenient "directory version" of tmpfile() that doesn't leave garbage behind when a process exits improperly. There's a way to do it on Linux by creating a tmpfs that's not attached to a path, but it's non-portable and quite baroque (and may involve creating a new user namespace to gain "privilege").

By cks at 2023-12-05 11:27:38:

I suspect that Linux (and other Unixes) forbid creating new files in a deleted directory to avoid having to delete them all when the directory reference is finally dropped (by everything cd'ing out of the directory and perhaps closing any open file descriptors to it). Supporting this would make life much more complicated in the kernel, possibly complete with arbitrary directory tree walks.

(Today I learned from another source that Linux actually supports unnamed temporary files through the O_TMPFILE option to open(), provided that the underlying filesystem supports it.)

By Konrad at 2023-12-05 16:09:47:

What's more interesting to me about O_TMPFILE is that the never-named file can later be linked into the filesystem, provided O_EXCL was not used. That means files can appear "fully formed", such that a crashing writer won't leave a partial one. In most other cases, there's not much reason to put a temporary file on a specific and "real" filesystem; when the files are expected to fit in memory, there's shm_open(SHM_ANON) on most systems, and memfd_create() on Linux (which can exactly emulate SHM_ANON, though neither libc or musl do that).

The point about having to walk the directory tree on deletion is a good one for directories that are part of real filesystems, whose data could be shared (via hard links or "refilnk") with files outside that tree. I suppose that hints at why the "temporary tmpfs" trick can work: the whole thing can trivially be deleted atomically. It'd be nice if systems made that easier to do. In principle, they could even add something like O_TMPFILE|O_DIRECTORY for those who want it on disk—either by using a real unnamed file as backing for something like tmpfs, or by allowing filesystems to "opt in" to a similar thing that would have the full feature set of the filesystem (which might enforce separation from the rest of the filesystem, such as by using an unnamed subvolume, for easy cleanup).

By cks at 2023-12-05 16:59:44:

For a real filesystem, the concern is broader than that new files in the directory might be shared outside of it. Unless the filesystem does something very special, everything created inside the now-deleted directory is a regular filesystem object (file, subdirectory, etc), with a regular on-disk presence. If the filesystem simply drops and removes the deleted top directory, all of those files and subdirectories leak; they're unreferenced from the outside, but they're allocated and using space. So the filesystem must walk the entire tree, explicitly unlinking or rmdir'ing each object (and then dealing with the possibility that some of them may still have lingering references).

You could also get rather surprising behavior. Consider a program that walks a portion of the filesystem hierarchy by chdir()'ing into each subdirectory and then using chddir("..") to back up. If such a program was the sole reference for a now-deleted directory with subdirectories, the moment it chdir()'d into a subdirectory, the top directory would be unreferenced and would be cleaned up, leaving the filesystem walker with no way to back up and reach any other portion of the tree. You'd get the same effect in a shell; 'mkdir temp; cd temp' and boom.

It's both less surprising and much easier to simply disallow creating new filesystem objects in a deleted directory.

By Konrad at 2023-12-05 19:10:46:

If the filesystem simply drops and removes the deleted top directory, all of those files and subdirectories leak

Naturally. But I was thinking that what appears to be a "simple" unlink call is kind of that way too: the filesystem needs to clean up a directory entry, maybe an inode and/or its backreference, some extended attributes, an extent map, update the free block list.... So I suspect it'd be practical as long as the OS knew it could rely on the filesystem to do that cleanup; hence the "opt-in" condition, as was done for O_TMPFILE.

While my interest in this hypothetical on-disk version is mostly a thought experiment, I'd expect an in-memory tmpdir() call to find practical uses; for example, it'd be a good way for a high-availability server to record its state (without having to write a serializer and deserializer—that is, an ad-hoc filesystem implementation—or deal with real writable disk space) and pass it to a newer version of itself. Well, as I said, it's possible now on Linux, and namespace-management tools do similar things implicitly (bubblewrap's root directory is a tmpfs with no binding outside its namespace); but I'm not sure many other people have realized that yet, nor tried to implement it on other Unices. (For reference, the relevant Linux calls are fsopen(), fsconfig(), and fsmount(), the latter returning an FD that can be used with openat() even if not bound to any path. Privilege may be needed but can be often faked with "unshare -mr --keep-caps" or similar.)

Your filesystem walker example is more interesting than you may realize. I'd consider any subdirectory to hold a reference to its parent, whether or not the ".." entry exists as a real hard link (it often does, and that's observable in a directory's link count); so, in my mind, the problem won't pop up there, and the parent(s) will just stick around. But the failure condition you described is reproducible in unmodified Linux: both chdir(".") and chdir("..") can fail if you lose permission to the directory, and a robust filesystem-walker must be prepared for it. For example, run "umask 022; mkdir -p ~/foo/bar" in your main user account; after you have another user "cd" into there, chmod both directories 0700, and the other user is "stuck". "cd -P ." and "cd -P .." will both fail, and an strace rules out any shell-related shenanigans; chdir(".") and chdir("..") are the failing syscalls.

Written on 02 December 2023.
« The Unix V6 shell and how control flow worked in it
A bit more trivia on the Unix V6 shell and its control flow »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Dec 2 22:51:48 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.