The history of file type information being available in Unix directories

August 25, 2018

The two things that Unix directory entries absolutely have to have are the name of the directory entry and its 'inode', by which we generically mean some stable kernel identifier for the file that will persist if it gets renamed, linked to other directories, and so on. Unsurprisingly, directory entries have had these since the days when you read the raw bytes of directories with read(), and for a long time that was all they had; if you wanted more than the name and the inode number, you had to stat() the file, not just read the directory. Then, well, I'll quote myself from an old entry on a find optimization:

[...], Unix filesystem developers realized that it was very common for programs reading directories to need to know a bit more about directory entries than just their names, especially their file types (find is the obvious case, but also consider things like 'ls -F'). Given that the type of an active inode never changes, it's possible to embed this information straight in the directory entry and then return this to user level, and that's what developers did; on some systems, readdir(3) will now return directory entries with an additional d_type field that has the directory entry's type.

On Twitter, I recently grumbled about Illumos not having this d_type field. The ensuing conversation wound up with me curious about exactly where d_type came from and how far back it went. The answer turns out to be a bit surprising due to there being two sides of d_type.

On the kernel side, d_type appears to have shown up in 4.4 BSD. The 4.4 BSD /usr/src/sys/dirent.h has a struct dirent that has a d_type field, but the field isn't documented in either the comments in the file or in the getdirentries(2) manpage; both of those admit only to the traditional BSD dirent fields. This 4.4 BSD d_type was carried through to things that inherited from 4.4 BSD (Lite), specifically FreeBSD, but it continued to be undocumented for at least a while.

(In FreeBSD, the most convenient history I can find is here, and the d_type field is present in sys/dirent.h as far back as FreeBSD 2.0, which seems to be as far as the repo goes for releases.)

Documentation for d_type appeared in the getdirentries(2) manpage in FreeBSD 2.2.0, where the manpage itself claims to have been updated on May 3rd 1995 (cf). In FreeBSD, this appears to have been part of merging 4.4 BSD 'Lite2', which seems to have been done in 1997. I stumbled over a repo of UCB BSD commit history, and in it the documentation appears in this May 3rd 1995 change, which at least has the same date. It appears that FreeBSD 2.2.0 was released some time in 1997, which is when this would have appeared in an official release.

In Linux, it seems that a dirent structure with a d_type member appeared only just before 2.4.0, which was released at the start of 2001. Linux took this long because the d_type field only appeared in the 64-bit 'large file support' version of the dirent structure, and so was only return by the new 64-bit getdents64() system call. This would have been a few years after FreeBSD officially documented d_type, and probably many years after it was actually available if you peeked at the structure definition.

(See here for an overview of where to get ancient Linux kernel history from.)

As far as I can tell, d_type is present on Linux, FreeBSD, OpenBSD, NetBSD, Dragonfly BSD, and Darwin (aka MacOS or OS X). It's not present on Solaris and thus Illumos. As far as other commercial Unixes go, you're on your own; all the links to manpages for things like AIX from my old entry on the remaining Unixes appear to have rotted away.

Sidebar: The filesystem also matters on modern Unixes

Even if your Unix supports d_type in directory entries, it doesn't mean that it's supported by the filesystem of any specific directory. As far as I know, every Unix with d_type support has support for it in their normal local filesystems, but it's not guaranteed to be in all filesystems, especially non-Unix ones like FAT32. Your code should always be prepared to deal with a file type of DT_UNKNOWN.

(Filesystems can implement support for file type information in directory entries in a number of different ways. The actual on disk format of directory entries is filesystem specific.)

It's also possible to have things the other way around, where you have a filesystem with support for file type information in directories that's on a Unix that doesn't support it. There are a number of plausible reasons for this to happen, but they're either obvious or beyond the scope of this entry.


Comments on this page:

From 78.58.206.110 at 2018-08-25 07:32:55:

A better source of historical Linux commits is: https://archive.org/download/git-history-of-linux

(alternatively: https://github.com/remram44/linux-full-history)

This includes all the repositories you've found, combined into one continuous chain -- so that you can git grep all the way to Linux 0.01.

A similar project for various BSDs is: https://github.com/dspinellis/unix-history-repo

By Late at 2018-12-21 06:37:47:

If you are interested also in QNX:

http://www.qnx.com/developers/docs/7.0.0/index.html#com.qnx.doc.neutrino.lib_ref/topic/r/readdir.html

So one needs to use special ugly-looking macros to get extra info out of struct dirent.

The rational behind keeping metadata out of directories was to have it all in one to allow for referential integrity, i.e. if it was in the directory and the statb/inode, then one might have a disagreement between the two.

UNIX was austere in directory information to prevent too much of the underlying filesystem information/abstraction "leaking" into the application's field of view. Following MULTICS likewise.

As it is, this created problems for labelled security (B1) implementations with the dependencies on parent/current directories ('.', '.."), making ACL's function sensibly, as well as omitting directory entries (link count) when security level did not allow files to be present in the directory - effectively the contents of the directory were rewritten on the fly. Ran into this at HP on their government labelled secure HP-UX implementation.

With non UNIX systems that have considerable attributes, trying to represent the applications view with the underlying filesystems one is an even more extensive rewrite.

So the point of the controversial addition of d_type (I disagreed with, thus why it didn't show up in 386BSD, because when at Sun Microsystems I'd had this debate) was to do "hinting" so that filesystem walk functions like fts() would have an advantage. Since BSD was usually well documented, one reason to leave it less so might be to allow the hint to be present but not used. Also, in such cases it was also to allow a follow-on (not to happen as BSD was "wound down" after 4.4 wisely) for potentially like hinting of a related but unresolved kind. So think of this as opaque inside of the structure, almost like field alignment.

My suggestion is that an opaque field in the dirent is what was desired long term.

Came across this as I was going though my 386BSD archives and putting them into github, including a request I'd gotten to allow interoperability with 4.4 UFS.

Believe Illuminos doesn't have it because SUNOS/Solaris didn't, because PSARC wanted it that way.

Written on 25 August 2018.
« Incremental development in Python versus actual tests
How ZFS maintains file type information in directories »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Aug 25 00:31:39 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.