2019-07-26
Some things on the GUID checksum in ZFS pool uberblocks
When I talked about how 'zpool import
' generates its view of a
pool's configuration, I mentioned that an
additional kernel check of the pool configuration is that ZFS
uberblocks have a simple 'checksum' of all of
the GUIDs of the vdev tree. When the kernel is considering
a pool configuration, it rejects it if the sum of the GUIDs in the
vdev tree doesn't match the GUID sum from the uberblock.
(The documentation of the disk format claims that it's only the checksum of the leaf vdevs, but as far as I can see from the code it's all vdevs.)
I was all set to write about how this interacts with the vdev
configurations that are in ZFS labels, but
as it turns out this is no longer applicable. In versions of ZFS
that have better ZFS pool recovery,
the vdev tree that's used is the one that's read from the pool's
Meta Object Set (MOS), not the pool configuration that was passed
in from user level by 'zpool import
'. Any mismatch between the
uberblock GUID sum and the vdev tree GUID sum likely indicates a
serious consistency problem somewhere.
(For the user level vdev tree, the difference between having a vdev's configuration and having all of its disks available is potentially important. As we saw yesterday, the ZFS label of every device that's part of a vdev has a complete copy of that vdev's configuration, including all of the GUIDs of its elements. Given a single intact ZFS label for a vdev, you can construct a configuration with all of the GUIDs filled in and thus pass the uberblock GUID sum validation, even if you don't have enough disks to actually use the vdev.)
The ZFS uberblock update sequence guarantees that the ZFS disk labels and their embedded vdev configurations should always be up to date with the current uberblock's GUID sum. Now that I know about the embedded uberblock GUID sum, it's pretty clear why the uberblock must be synced on all vdevs when the vdev or pool configuration is considered 'dirty'. The moment that the GUID sum of the current vdev tree changes, you'd better update everything to match it.
(The GUID sum changes if any rearrangement of the vdev tree happens.
This includes replacing one disk with another, since each disk has
a unique GUID sum. In case you're curious, the ZFS disk label always
has the full tree for a top level vdev, including the special
'replacing
' and 'spare
' sub-vdevs that show up during these
operations.)
PS: My guess from a not very extensive look through the kernel code
is that it's very hard to tell from user level if you have a genuine
uberblock GUID sum mismatch or another problem that returns the
same extended error code to user level. The good news is that I
think the only other case that returns VDEV_AUX_BAD_GUID_SUM
is if you have missing log device(s).
How 'zpool import
' generates its view of a pool's configuration
Full bore ZFS pool import happens in two stages,
where 'zpool import
' puts together a vdev configuration for the
pool, passes it to the kernel, and then the kernel reads the real
pool configuration from ZFS objects in the pool's Meta Object Set.
How 'zpool import
' does this is outlined at a high level by a
comment in zutil_import.c
;
to summarize the comment, the configuration is created by assembling
and merging together information from the ZFS label of each device.
There is an important limitation to this process, which is that the
ZFS label only contains information on the vdev configuration, not
on the overall pool configuration.
To show you what I mean, here's relevant portions of a ZFS label
(as dumped by 'zdb -l
') for a device from one of our pools:
txg: 5059313 pool_guid: 756813639445667425 top_guid: 4603657949260704837 guid: 13307730581331167197 vdev_children: 5 vdev_tree: type: 'mirror' id: 3 guid: 4603657949260704837 is_log: 0 children[0]: type: 'disk' id: 0 guid: 7328257775812323847 path: '/dev/disk/by-path/pci-0000:19:00.0-sas-phy3-lun-0-part6' children[1]: type: 'disk' id: 1 guid: 13307730581331167197 path: '/dev/disk/by-path/pci-0000:00:17.0-ata-4-part6'
(For much more details that are somewhat out of date, see the ZFS On-Disk Specifications [pdf].)
Based on this label, 'zpool import
' knows what the GUID of this
vdev is, which disk of the vdev it's dealing with and where the
other disk or disks in it are supposed to be found, the pool's GUID,
how many vdevs the pool has in total (it has 5) and which specific
vdev this is (it's the fourth of five; vdev numbering starts from
0). But it doesn't know anything about the other vdevs, except
that they exist (or should exist).
When zpool assembles the pool configuration, it will use the best
information it has for each vdev, where the 'best' is taken to be
the vdev label with the highest txg
(transaction group number).
The label with the highest txg for the entire pool is used to
determine how many vdevs the pool is supposed to have. Note that
there's no check that the best label for a particular vdev has a
txg that is anywhere near the pool's (assumed) current txg. This
means that if all of the modern devices for a particular vdev
disappear and a very old device for it reappears, it's possible for
zpool to assemble a (user-level) configuration that claims that the
old device is that vdev (or the only component available for that
vdev, which might be enough if the vdev is a mirror).
If zpool can't find any labels for a particular vdev, all it can
do in the configuration is fill in an artificial 'there is a vdev
missing' marker; it doesn't even know whether it was a raidz or a
mirrored vdev, or how much data is on it. When 'zpool import
'
prints the resulting configuration, it doesn't explicitly show these
missing vdevs; if I'm reading the code right, your only clue as to
where they are is that the pool configuration will abruptly skip
from, eg, 'mirror-0' to 'mirror-2' without reporting 'mirror-1'.
There's an additional requirement for a working pool configuration,
although it's only checked by the kernel, not zpool. The pool
uberblocks have a ub_guid_sum
field, which must match the sum
of all GUIDs in the vdev tree. If the GUID sum doesn't match, you'll
get one of those frustrating 'a device is missing somewhere' errors
on pool import. An entirely missing vdev naturally forces this to
happen, since all of its GUIDs are unknown and obviously not
contributing what they should be to this sum. I don't know how this
interacts with better ZFS pool recovery.