ZFS scrubs check (much) less than you probably think they do

October 23, 2018

Several years ago I wrote an entry on the limits of what ZFS scrubs check. In that entry I said:

The simple version of what a ZFS scrub does is that it verifies the checksum for every copy of every (active) block in the ZFS pool. It also explicitly verifies parity blocks for RAIDZ vdevs (which a normal error-free read does not). In the process of doing this verification, the scrub must walk the entire object tree of the pool from the top downwards, which has the side effect of more or less verifying this hierarchy; certainly if there's something like a directory entry that points to an invalid thing, you will get a checksum error somewhere in the process.

(The emphasis is new.)

As I wrote this and as people will read it, I am pretty sure that this is incorrect, because at the time I did not understand how ZFS filesystems and pools were really structured and how this made ZFS scrubs fundamentally different from the way that fsck usually works.

The straightforward and ordinary way that fsck programs are written for conventional filesystems is that they start at the root directory of the filesystem and follow everything down from there, eventually looking at every live file and object. In the process they build up a map of the disk blocks and inodes that are in use and free, and how many links each inode is supposed to have, and so on, and they can detect various sorts of inconsistencies in this data. Because they walk through the entire filesystem directory tree, they always notice if your directories are corrupt; reading through your directories is how they figure out what to do next.

ZFS scrubs famously don't verify that various sorts of filesystem metadata are correct; for example, the ZFS filesystem with bad ACLs that I mentioned in this entry passes pool scrubs. But until recently I thought that ZFS scrubs still traversed your ZFS pool and filesystems in the same way that fsck did, and in the process they more or less verified the integrity of your ZFS filesystem directories for the same reason, because that's how they knew what to visit next. If you had a corrupt entry that pointed to nothing or to an unallocated dnode or something, a scrub would either complain or panic (but at least you'd know).

But ZFS filesystems and ZFS pools are not really organized this way, as I found out when I actually did my research. Instead, each ZFS filesystem is in essence an object set of dnodes plus some extra information. Each dnode is self-contained; given only a block pointer to a dnode, you can completely verify the checksums of all of the dnode's data, without really having to know much about what that data actually means. This means that if all you care about is that the checksums of everything in a filesystem is correct, all you have to do is fetch the filesystem's object set and then verify the checksums of every allocated dnode in it. ZFS doesn't have to walk through the filesystem's directory tree to verify all of its checksums, and I am pretty sure that ZFS scrubs and resilvers don't bother to do so.

As a result, provided that all of the block checksums verify, ZFS scrubs are very likely to be splendidly indifferent to things like what is actually in your filesystem directories and what dnode object numbers your files claim to be and so on. Scrubs need to use and thus verify a bit of the dnode structure simply in order to find all of its data blocks through indirect blocks, but they don't need to even look at a lot of other things associated with dnodes (such as the structure of system attributes). It's possible that verifying the block checksums of filesystem directories requires some analysis of their general structure, but that general structure is generic.

(ZFS filesystem directories are ZAP objects, which are a generic ZFS thing to used to store name/value pairs. You can read through all of the disk blocks of a ZAP object without knowing what the keys and their values mean or if they mean anything, although I think you'll basically verify that the actual hash table structure is correct.)

(What I wrote is potentially technically correct in that there are DSL (Dataset and Snapshot Layer) directories and so on, and scrubs may have to traverse through them to find the object sets of your filesystems (see the discussion in my broad overview of how ZFS is structured on disk). But I didn't even really understand those when I wrote my entry, and I was talking about ZFS filesystem directories.)

Written on 23 October 2018.
« Some DKIM usage statistics from our recent inbound email (October 2018 edition)
You can sort of use zdb as a substitute for a proper ZFS fsck »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Oct 23 23:47:52 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.