How ZFS maintains file type information in directories

August 26, 2018

As an aside in yesterday's history of file type information being available in Unix directories, I mentioned that it was possible for a filesystem to support this even though its Unix didn't. By supporting it, I mean that the filesystem maintains this information in its on disk format for directories, even though the rest of the kernel will never ask for it. This is what ZFS does.

(One reason to do this in a filesystem is future-proofing it against a day when your Unix might decide to support this in general; another is if you ever might want the filesystem to be a first class filesystem in another Unix that does support this stuff. In ZFS's case, I suspect that the first motivation was larger than the second one.)

The easiest way to see that ZFS does this is to use zdb to dump a directory. I'm going to do this on an OmniOS machine, to make it more convincing, and it turns out that this has some interesting results. Since this is OmniOS, we don't have the convenience of just naming a directory in zdb, so let's find the root directory of a filesystem, starting from dnode 1 (as seen before).

# zdb -dddd fs3-corestaff-01/h/281 1
Dataset [....]
[...]
    microzap: 512 bytes, 4 entries
[...]
         ROOT = 3 

# zdb -dddd fs3-corestaff-01/h/281 3
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
        3    1    16K     1K     8K     1K  100.00  ZFS directory
[...]
    microzap: 1024 bytes, 8 entries

         RESTORED = 4396504 (type: Directory)
         ckstst = 12017 (type: not specified)
         ckstst3 = 25069 (type: Directory)
         .demo-file = 5832188 (type: Regular File)
         .peergroup = 12590 (type: not specified)
         cks = 5 (type: not specified)
         cksimap1 = 5247832 (type: Directory)
         .diskuse = 12016 (type: not specified)
         ckstst2 = 12535 (type: not specified)

This is actually an old filesystem (it dates from Solaris 10 and has been transferred around with 'zfs send | zfs recv' since then), but various home directories for real and test users have been created in it over time (you can probably guess which one is the oldest one). Sufficiently old directories and files have no file type information, but more recent ones have this information, including .demo-file, which I made just now so this would have an entry that was a regular file with type information.

Once I dug into it, this turned out to be a change introduced (or activated) in ZFS filesystem version 2, which is described in 'zfs upgrade -v' as 'enhanced directory entries'. As an actual change in (Open)Solaris, it dates from mid 2007, although I'm not sure what Solaris release it made it into. The upshot is that if you made your ZFS filesystem any time in the last decade, you'll have this file type information in your directories.

How ZFS stores this file type information is interesting and clever, especially when it comes to backwards compatibility. I'll start by quoting the comment from zfs_znode.h:

/*
 * The directory entry has the type (currently unused on
 * Solaris) in the top 4 bits, and the object number in
 * the low 48 bits.  The "middle" 12 bits are unused.
 */

In yesterday's entry I said that Unix directory entries need to store at least the filename and the inode number of the file. What ZFS is doing here is reusing the 64 bit field used for the 'inode' (the ZFS dnode number) to also store the file type, because it knows that object numbers have only a limited range. This also makes old directory entries compatible, by making type 0 (all 4 bits 0) mean 'not specified'. Since old directory entries only stored the object number and the object number is 48 bits or less, the higher bits are guaranteed to be all zero.

(It seems common to define DT_UNKNOWN to be 0; both FreeBSD and Linux do it.)

The reason this needed a new ZFS filesystem version is now clear. If you tried to read directory entries with file type information on a version of ZFS that didn't know about them, the old version would likely see crazy (and non-existent) object numbers and nothing would work. In order to even read a 'file type in directory entries' filesystem, you need to know to only look at the low 48 bits of the object number field in directory entries.

(As before, I consider this a neat hack that cleverly uses some properties of ZFS and the filesystem to its advantage.)

Written on 26 August 2018.
« The history of file type information being available in Unix directories
A little bit of the one-time MacOS version still lingers in ZFS »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Aug 26 00:43:13 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.