ZFS pool imports happen in two stages of pool configuration processing

July 24, 2019

The mechanics of how ZFS pools are imported is one of the more obscure areas of ZFS, which is a potential problem given that things can go very wrong (often with quite unhelpful errors). One special thing about ZFS pool importing is that it effectively happens in two stages, first with user-level processing and then again in the kernel, and these two stages use two potentially different pool configurations. My primary source for this is the discussion from Illumos issue #9075:

[...] One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. [...]

Here's my translation of that. In order to tell the kernel to load a pool, 'zpool import' has to come up with a vdev configuration for the pool and then provide it to the kernel. However, this is not the real pool configuration; the real pool configuration is stored in the pool itself (in regular ZFS objects that are part of the MOS), where the kernel reads it again as the kernel imports the pool.

Although not mentioned explicitly, the pool configuration that 'zpool import' comes up with and passes to the kernel is not read from the canonical pool configuration, because reading those ZFS objects from the MOS requires a relatively full implementation of ZFS, which 'zpool import' does not have (the kernel obviously does). One source of the pool configuration for 'zpool import' is the ZFS cache file, /etc/zfs/zpool.cache, which theoretically contains current pool configurations for all active pools. How 'zpool import' generates a pool configuration for exported or deleted pools is sufficiently complicated to need an entry of its own.

This two stage process means that there are at least two different things that can go wrong with a ZFS pool's configuration information. First, 'zpool import' may not be able to put together what it thinks is a valid pool configuration, in which case I believe that it doesn't even try to pass it to the kernel. Second, the kernel may dislike the configuration that it's handed for its own reasons. In older versions of ZFS (before better ZFS pool recovery landed), any mismatch between the actual pool configuration and the claimed configuration from user level was apparently fatal; now, only some problems are fatal.

As far as I know, 'zpool import' doesn't clearly distinguish between these two cases in its error messages when you're actually trying to import a pool. If you're just running it to see what pools are available, I believe that all of what 'zpool import' reports comes purely from its own limited and potentially imperfect configuration assembly, with no kernel involvement.

(When a pool is fully healthy and in good shape, the configuration that 'zpool import' puts together at the user level will completely match the real configuration in the MOS. When it's not is when you run into potential problems.)


Comments on this page:

It's been a while since I looked into ZFS details, but from what I remember getting the config from a vdev is easy. Unless I'm misremembering, the pool config is stored in the vdev labels - not inside the MOS. And reading the label is pretty simple - just open the potential vdev, get the size, and read the 4 labels at the fixed offsets. Then, parse the config with libnvpair.

Of course getting the config from one vdev isn't quite enough. During the pool reassembly process, the configs from each vdev need to be "merged" and updated - disks can move around, and disks that used to be offline/inaccessible may have become online/accessible again.

Written on 24 July 2019.
« Why file and directory operations are synchronous in NFS
I think I like systemd's DynamicUser feature (under the right circumstances) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jul 24 00:52:07 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.