== How ZFS makes things like '_zfs diff_' report filenames efficiently As a copy on write (file)system, ZFS can use the [[transaction group (txg) numbers ZFSTXGsAndZILs]] that are embedded in [[ZFS block pointers ZFSBlockPointers]] to efficiently find the differences between two txgs; this is used in, for example, [[ZFS bookmarks ZFSBookmarksMechanism]]. However, as I noted at the end of [[my entry on block pointers ZFSBlockPointers]], this doesn't give us a filesystem level difference; instead, it essentially gives us a list of inodes (okay, [[dnodes ZFSBroadDiskStructure]]) that changed. In theory, turning an inode or dnode number into the path to a file is an expensive operation; you basically have to search the entire filesystem until you find it. In practice, if you've ever run '_zfs diff_', you've likely noticed that it runs pretty fast. Nor is this the only place that ZFS quickly turns dnode numbers into full paths, as it comes up in [['_zpool status_' reports about permanent errors ZFSPermanentErrorsMeaning]]. At one level, _zfs diff_ and _zpool status_ do this so rapidly because they ask the ZFS code in the kernel to do it for them. At another level, the question is how the kernel's ZFS code can be so fast. The interesting and surprising answer is that ZFS cheats, in a way that makes things very fast when it works and almost always works in normal filesystems and with normal usage patterns. The cheat is that ~~ZFS dnodes record their parent's object number~~. Here, let's show this in _zdb_: .pn prewrap on # zdb zdb -vvv -bbbb -O ssddata/homes cks/tmp/a/b Object lvl iblk dblk dsize dnsize lsize %full type 1285414 1 128K 512 0 512 512 0.00 ZFS plain file [...] parent 1284472 [...] # zdb -vvv -bbbb -O ssddata/homes cks/tmp/a Object lvl iblk dblk dsize dnsize lsize %full type 1284472 1 128K 512 0 512 512 100.00 ZFS directory [...] parent 52906 [...] microzap: 512 bytes, 1 entries b = 1285414 (type: Regular File) The _b_ file has a _parent_ field that points to _cks/tmp/a_, the directory it's in, and the _a_ directory has a _parent_ field that points to _cks/tmp_, and so on. When the kernel wants to get the name for a given object number, it can just fetch the object, look at _parent_, and start going back up the filesystem. (If you want to see this sausage being made, look at ((zfs_obj_to_path)) and ((zfs_obj_to_pobj)) in [[zfs_znode.c https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/fs/zfs/zfs_znode.c]]. The _parent_ field is a [[ZFS dnode system attribute https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/fs/zfs/sa.c]], specifically ((ZPL_PARENT)).) If you're familiar with the twists and turns of Unix filesystems, you're now wondering how ZFS deals with hardlinks, which can cause a file to be in several directories at once and so have several parents (and then it can be removed from some of the directories). The answer is that ZFS doesn't; ~~a dnode only ever tracks a single parent, and ZFS accepts that this parent information can be inaccurate~~. I'll quote the comment in ((zfs_obj_to_pobj)): > When a link is removed [the file's] parent pointer is not changed and > will be invalid. There are two cases where a link is removed but the > file stays around, when it goes to the [[delete queue ZFSDeleteQueue]] > and when there are additional links. Before I get into the details, I want to say that I appreciate the brute force elegance of this cheat. The practical reality is that most Unix files today don't have extra hardlinks, and when they do most hardlinks are done in ways that won't break ZFS's _parent_ stuff. The result is that ZFS has picked an efficient implementation that works almost all of the time; in my opinion, the great benefit we get from having it around are more than worth the infrequent cases where it fails or malfunctions. Both _zfs diff_ and having filenames show up in _zpool status_ permanent error reports are very useful (and there may be other cases where this gets used). The current details are that any time you hardlink a file to somewhere or rename it, ZFS updates the file's _parent_ to point to the new directory. Often this will wind up with a correct _parent_ even after all of the dust settles; for example, a common pattern is to write a file to an initial location, hardlink it to its final destination, and then remove the initial location version. In this case, the _parent_ will be correct and you'll get the right name. The time when you get an incorrect _parent_ is this sequence: > ; mkdir a b; touch a/demo > ; ln a/demo b/ > ; rm b/demo Here _a/demo_ is the remaining path, but _demo_'s dnode will claim that its parent is _b_. I believe that _zfs diff_ will even report this as the path, because the kernel doesn't do the extra work to scan the _b_ directory to verify that _demo_ is present in it. (This behavior is undocumented and thus is subject to change at the convenience of the ZFS people.)