How 'zpool import' generates its view of a pool's configuration

July 26, 2019

Full bore ZFS pool import happens in two stages, where 'zpool import' puts together a vdev configuration for the pool, passes it to the kernel, and then the kernel reads the real pool configuration from ZFS objects in the pool's Meta Object Set. How 'zpool import' does this is outlined at a high level by a comment in zutil_import.c; to summarize the comment, the configuration is created by assembling and merging together information from the ZFS label of each device. There is an important limitation to this process, which is that the ZFS label only contains information on the vdev configuration, not on the overall pool configuration.

To show you what I mean, here's relevant portions of a ZFS label (as dumped by 'zdb -l') for a device from one of our pools:

   txg: 5059313
   pool_guid: 756813639445667425
   top_guid: 4603657949260704837
   guid: 13307730581331167197
   vdev_children: 5
       type: 'mirror'
       id: 3
       guid: 4603657949260704837
       is_log: 0
           type: 'disk'
           id: 0
           guid: 7328257775812323847
           path: '/dev/disk/by-path/pci-0000:19:00.0-sas-phy3-lun-0-part6'
           type: 'disk'
           id: 1
           guid: 13307730581331167197
           path: '/dev/disk/by-path/pci-0000:00:17.0-ata-4-part6'

(For much more details that are somewhat out of date, see the ZFS On-Disk Specifications [pdf].)

Based on this label, 'zpool import' knows what the GUID of this vdev is, which disk of the vdev it's dealing with and where the other disk or disks in it are supposed to be found, the pool's GUID, how many vdevs the pool has in total (it has 5) and which specific vdev this is (it's the fourth of five; vdev numbering starts from 0). But it doesn't know anything about the other vdevs, except that they exist (or should exist).

When zpool assembles the pool configuration, it will use the best information it has for each vdev, where the 'best' is taken to be the vdev label with the highest txg (transaction group number). The label with the highest txg for the entire pool is used to determine how many vdevs the pool is supposed to have. Note that there's no check that the best label for a particular vdev has a txg that is anywhere near the pool's (assumed) current txg. This means that if all of the modern devices for a particular vdev disappear and a very old device for it reappears, it's possible for zpool to assemble a (user-level) configuration that claims that the old device is that vdev (or the only component available for that vdev, which might be enough if the vdev is a mirror).

If zpool can't find any labels for a particular vdev, all it can do in the configuration is fill in an artificial 'there is a vdev missing' marker; it doesn't even know whether it was a raidz or a mirrored vdev, or how much data is on it. When 'zpool import' prints the resulting configuration, it doesn't explicitly show these missing vdevs; if I'm reading the code right, your only clue as to where they are is that the pool configuration will abruptly skip from, eg, 'mirror-0' to 'mirror-2' without reporting 'mirror-1'.

There's an additional requirement for a working pool configuration, although it's only checked by the kernel, not zpool. The pool uberblocks have a ub_guid_sum field, which must match the sum of all GUIDs in the vdev tree. If the GUID sum doesn't match, you'll get one of those frustrating 'a device is missing somewhere' errors on pool import. An entirely missing vdev naturally forces this to happen, since all of its GUIDs are unknown and obviously not contributing what they should be to this sum. I don't know how this interacts with better ZFS pool recovery.

