How ZFS makes things like 'zfs diff
' report filenames efficiently
As a copy on write (file)system, ZFS can use the transaction group (txg) numbers that are embedded in ZFS block pointers to efficiently find the differences between two txgs; this is used in, for example, ZFS bookmarks. However, as I noted at the end of my entry on block pointers, this doesn't give us a filesystem level difference; instead, it essentially gives us a list of inodes (okay, dnodes) that changed.
In theory, turning an inode or dnode number into the path to a file
is an expensive operation; you basically have to search the entire
filesystem until you find it. In practice, if you've ever run 'zfs
diff
', you've likely noticed that it runs pretty fast. Nor is
this the only place that ZFS quickly turns dnode numbers into full
paths, as it comes up in 'zpool status
' reports about permanent
errors. At one level, zfs diff
and
zpool status
do this so rapidly because they ask the ZFS code in
the kernel to do it for them. At another level, the question is how
the kernel's ZFS code can be so fast.
The interesting and surprising answer is that ZFS cheats, in a way
that makes things very fast when it works and almost always works
in normal filesystems and with normal usage patterns. The cheat is
that ZFS dnodes record their parent's object number. Here, let's
show this in zdb
:
# zdb zdb -vvv -bbbb -O ssddata/homes cks/tmp/a/b Object lvl iblk dblk dsize dnsize lsize %full type 1285414 1 128K 512 0 512 512 0.00 ZFS plain file [...] parent 1284472 [...] # zdb -vvv -bbbb -O ssddata/homes cks/tmp/a Object lvl iblk dblk dsize dnsize lsize %full type 1284472 1 128K 512 0 512 512 100.00 ZFS directory [...] parent 52906 [...] microzap: 512 bytes, 1 entries b = 1285414 (type: Regular File)
The b
file has a parent
field that points to cks/tmp/a
, the
directory it's in, and the a
directory has a parent
field that
points to cks/tmp
, and so on. When the kernel wants to get the
name for a given object number, it can just fetch the object, look
at parent
, and start going back up the filesystem.
(If you want to see this sausage being made, look at zfs_obj_to_path
and zfs_obj_to_pobj
in zfs_znode.c.
The parent
field is a ZFS dnode system attribute,
specifically ZPL_PARENT
.)
If you're familiar with the twists and turns of Unix filesystems,
you're now wondering how ZFS deals with hardlinks, which can cause
a file to be in several directories at once and so have several
parents (and then it can be removed from some of the directories).
The answer is that ZFS doesn't; a dnode only ever tracks a single
parent, and ZFS accepts that this parent information can be
inaccurate. I'll quote the comment in zfs_obj_to_pobj
:
When a link is removed [the file's] parent pointer is not changed and will be invalid. There are two cases where a link is removed but the file stays around, when it goes to the delete queue and when there are additional links.
Before I get into the details, I want to say that I appreciate the
brute force elegance of this cheat. The practical reality is that
most Unix files today don't have extra hardlinks, and when they do
most hardlinks are done in ways that won't break ZFS's parent
stuff. The result is that ZFS has picked an efficient implementation
that works almost all of the time; in my opinion, the great benefit
we get from having it around are more than worth the infrequent
cases where it fails or malfunctions. Both zfs diff
and having
filenames show up in zpool status
permanent error reports are
very useful (and there may be other cases where this gets used).
The current details are that any time you hardlink a file to somewhere
or rename it, ZFS updates the file's parent
to point to the new
directory. Often this will wind up with a correct parent
even
after all of the dust settles; for example, a common pattern is to
write a file to an initial location, hardlink it to its final
destination, and then remove the initial location version. In this
case, the parent
will be correct and you'll get the right name.
The time when you get an incorrect parent
is this sequence:
; mkdir a b; touch a/demo ; ln a/demo b/ ; rm b/demo
Here a/demo
is the remaining path, but demo
's dnode will claim
that its parent is b
. I believe that zfs diff
will even report
this as the path, because the kernel doesn't do the extra work
to scan the b
directory to verify that demo
is present in it.
(This behavior is undocumented and thus is subject to change at the convenience of the ZFS people.)
|
|