How ZFS makes things like '
zfs diff' report filenames efficiently
As a copy on write (file)system, ZFS can use the transaction group (txg) numbers that are embedded in ZFS block pointers to efficiently find the differences between two txgs; this is used in, for example, ZFS bookmarks. However, as I noted at the end of my entry on block pointers, this doesn't give us a filesystem level difference; instead, it essentially gives us a list of inodes (okay, dnodes) that changed.
In theory, turning an inode or dnode number into the path to a file
is an expensive operation; you basically have to search the entire
filesystem until you find it. In practice, if you've ever run '
diff', you've likely noticed that it runs pretty fast. Nor is
this the only place that ZFS quickly turns dnode numbers into full
paths, as it comes up in '
zpool status' reports about permanent
errors. At one level,
zfs diff and
zpool status do this so rapidly because they ask the ZFS code in
the kernel to do it for them. At another level, the question is how
the kernel's ZFS code can be so fast.
The interesting and surprising answer is that ZFS cheats, in a way
that makes things very fast when it works and almost always works
in normal filesystems and with normal usage patterns. The cheat is
that ZFS dnodes record their parent's object number. Here, let's
show this in
# zdb zdb -vvv -bbbb -O ssddata/homes cks/tmp/a/b Object lvl iblk dblk dsize dnsize lsize %full type 1285414 1 128K 512 0 512 512 0.00 ZFS plain file [...] parent 1284472 [...] # zdb -vvv -bbbb -O ssddata/homes cks/tmp/a Object lvl iblk dblk dsize dnsize lsize %full type 1284472 1 128K 512 0 512 512 100.00 ZFS directory [...] parent 52906 [...] microzap: 512 bytes, 1 entries b = 1285414 (type: Regular File)
b file has a
parent field that points to
directory it's in, and the
a directory has a
parent field that
cks/tmp, and so on. When the kernel wants to get the
name for a given object number, it can just fetch the object, look
parent, and start going back up the filesystem.
If you're familiar with the twists and turns of Unix filesystems,
you're now wondering how ZFS deals with hardlinks, which can cause
a file to be in several directories at once and so have several
parents (and then it can be removed from some of the directories).
The answer is that ZFS doesn't; a dnode only ever tracks a single
parent, and ZFS accepts that this parent information can be
inaccurate. I'll quote the comment in
When a link is removed [the file's] parent pointer is not changed and will be invalid. There are two cases where a link is removed but the file stays around, when it goes to the delete queue and when there are additional links.
Before I get into the details, I want to say that I appreciate the
brute force elegance of this cheat. The practical reality is that
most Unix files today don't have extra hardlinks, and when they do
most hardlinks are done in ways that won't break ZFS's
stuff. The result is that ZFS has picked an efficient implementation
that works almost all of the time; in my opinion, the great benefit
we get from having it around are more than worth the infrequent
cases where it fails or malfunctions. Both
zfs diff and having
filenames show up in
zpool status permanent error reports are
very useful (and there may be other cases where this gets used).
The current details are that any time you hardlink a file to somewhere
or rename it, ZFS updates the file's
parent to point to the new
directory. Often this will wind up with a correct
after all of the dust settles; for example, a common pattern is to
write a file to an initial location, hardlink it to its final
destination, and then remove the initial location version. In this
parent will be correct and you'll get the right name.
The time when you get an incorrect
parent is this sequence:
; mkdir a b; touch a/demo ; ln a/demo b/ ; rm b/demo
a/demo is the remaining path, but
demo's dnode will claim
that its parent is
b. I believe that
zfs diff will even report
this as the path, because the kernel doesn't do the extra work
to scan the
b directory to verify that
demo is present in it.
(This behavior is undocumented and thus is subject to change at the convenience of the ZFS people.)