Wandering Thoughts archives

2019-07-26

Some things on the GUID checksum in ZFS pool uberblocks

When I talked about how 'zpool import' generates its view of a pool's configuration, I mentioned that an additional kernel check of the pool configuration is that ZFS uberblocks have a simple 'checksum' of all of the GUIDs of the vdev tree. When the kernel is considering a pool configuration, it rejects it if the sum of the GUIDs in the vdev tree doesn't match the GUID sum from the uberblock.

(The documentation of the disk format claims that it's only the checksum of the leaf vdevs, but as far as I can see from the code it's all vdevs.)

I was all set to write about how this interacts with the vdev configurations that are in ZFS labels, but as it turns out this is no longer applicable. In versions of ZFS that have better ZFS pool recovery, the vdev tree that's used is the one that's read from the pool's Meta Object Set (MOS), not the pool configuration that was passed in from user level by 'zpool import'. Any mismatch between the uberblock GUID sum and the vdev tree GUID sum likely indicates a serious consistency problem somewhere.

(For the user level vdev tree, the difference between having a vdev's configuration and having all of its disks available is potentially important. As we saw yesterday, the ZFS label of every device that's part of a vdev has a complete copy of that vdev's configuration, including all of the GUIDs of its elements. Given a single intact ZFS label for a vdev, you can construct a configuration with all of the GUIDs filled in and thus pass the uberblock GUID sum validation, even if you don't have enough disks to actually use the vdev.)

The ZFS uberblock update sequence guarantees that the ZFS disk labels and their embedded vdev configurations should always be up to date with the current uberblock's GUID sum. Now that I know about the embedded uberblock GUID sum, it's pretty clear why the uberblock must be synced on all vdevs when the vdev or pool configuration is considered 'dirty'. The moment that the GUID sum of the current vdev tree changes, you'd better update everything to match it.

(The GUID sum changes if any rearrangement of the vdev tree happens. This includes replacing one disk with another, since each disk has a unique GUID sum. In case you're curious, the ZFS disk label always has the full tree for a top level vdev, including the special 'replacing' and 'spare' sub-vdevs that show up during these operations.)

PS: My guess from a not very extensive look through the kernel code is that it's very hard to tell from user level if you have a genuine uberblock GUID sum mismatch or another problem that returns the same extended error code to user level. The good news is that I think the only other case that returns VDEV_AUX_BAD_GUID_SUM is if you have missing log device(s).

solaris/ZFSUberblockGUIDSumNotes written at 22:51:41; Add Comment

How 'zpool import' generates its view of a pool's configuration

Full bore ZFS pool import happens in two stages, where 'zpool import' puts together a vdev configuration for the pool, passes it to the kernel, and then the kernel reads the real pool configuration from ZFS objects in the pool's Meta Object Set. How 'zpool import' does this is outlined at a high level by a comment in zutil_import.c; to summarize the comment, the configuration is created by assembling and merging together information from the ZFS label of each device. There is an important limitation to this process, which is that the ZFS label only contains information on the vdev configuration, not on the overall pool configuration.

To show you what I mean, here's relevant portions of a ZFS label (as dumped by 'zdb -l') for a device from one of our pools:

   txg: 5059313
   pool_guid: 756813639445667425
   top_guid: 4603657949260704837
   guid: 13307730581331167197
   vdev_children: 5
   vdev_tree:
       type: 'mirror'
       id: 3
       guid: 4603657949260704837
       is_log: 0
       children[0]:
           type: 'disk'
           id: 0
           guid: 7328257775812323847
           path: '/dev/disk/by-path/pci-0000:19:00.0-sas-phy3-lun-0-part6'
       children[1]:
           type: 'disk'
           id: 1
           guid: 13307730581331167197
           path: '/dev/disk/by-path/pci-0000:00:17.0-ata-4-part6'

(For much more details that are somewhat out of date, see the ZFS On-Disk Specifications [pdf].)

Based on this label, 'zpool import' knows what the GUID of this vdev is, which disk of the vdev it's dealing with and where the other disk or disks in it are supposed to be found, the pool's GUID, how many vdevs the pool has in total (it has 5) and which specific vdev this is (it's the fourth of five; vdev numbering starts from 0). But it doesn't know anything about the other vdevs, except that they exist (or should exist).

When zpool assembles the pool configuration, it will use the best information it has for each vdev, where the 'best' is taken to be the vdev label with the highest txg (transaction group number). The label with the highest txg for the entire pool is used to determine how many vdevs the pool is supposed to have. Note that there's no check that the best label for a particular vdev has a txg that is anywhere near the pool's (assumed) current txg. This means that if all of the modern devices for a particular vdev disappear and a very old device for it reappears, it's possible for zpool to assemble a (user-level) configuration that claims that the old device is that vdev (or the only component available for that vdev, which might be enough if the vdev is a mirror).

If zpool can't find any labels for a particular vdev, all it can do in the configuration is fill in an artificial 'there is a vdev missing' marker; it doesn't even know whether it was a raidz or a mirrored vdev, or how much data is on it. When 'zpool import' prints the resulting configuration, it doesn't explicitly show these missing vdevs; if I'm reading the code right, your only clue as to where they are is that the pool configuration will abruptly skip from, eg, 'mirror-0' to 'mirror-2' without reporting 'mirror-1'.

There's an additional requirement for a working pool configuration, although it's only checked by the kernel, not zpool. The pool uberblocks have a ub_guid_sum field, which must match the sum of all GUIDs in the vdev tree. If the GUID sum doesn't match, you'll get one of those frustrating 'a device is missing somewhere' errors on pool import. An entirely missing vdev naturally forces this to happen, since all of its GUIDs are unknown and obviously not contributing what they should be to this sum. I don't know how this interacts with better ZFS pool recovery.

solaris/ZFSZpoolImportAssembly written at 01:18:58; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.