Things I do and don't know about how ZFS brings pools up during boot

August 16, 2017

If you import a ZFS pool explicitly, through 'zpool import', the user-mode side of the process normally searches through all of the available disks in order to find the component devices of the pool. Because it does this explicit search, it will find pool devices even if they've been shuffled around in a way that causes them to be renamed, or even (I think) drastically transformed, for example by being dd'd to a new disk. This is pretty much what you'd expect, since ZFS can't really read what the pool thinks its configuration is until it assembles the pool. When it imports such a pool, I believe that ZFS rewrites the information kept about where to find each device so that it's correct for the current state of your system.

This is not what happens when the system boots. To the best of my knowledge, for non-root pools the ZFS kernel module directly reads /etc/zfs/zpool.cache during module initialization and converts it into a series of in-memory pool configurations for pools, which are all in an unactivated state. At some point, magic things attempt to activate some or all of these pools, which causes the kernel to attempt to open all of the devices listed as part of the pool configuration and verify that they are indeed part of the pool. The process of opening devices only uses the names and other identification of the devices that's in the pool configuration; however, one identification is a 'devid', which for many devices is basically the model and serial number of the disk. So I believe that under at least some circumstances the kernel will still be able to find disks that have been shuffled around, because it will basically seek out that model plus serial number wherever it's (now) connected to the system.

(See vdev_disk_open in vdev_disk.c for the gory details, but you also need to understand Illumos devids. The various device information available for disks in a pool can be seen with 'zdb -C <pool>'.)

To the best of my knowledge, this in-kernel activation makes no attempt to hunt around on other disks to complete the pool's configuration the way that 'zpool import' will. In theory, assuming that finding disks by their devid works, this shouldn't matter most or basically all of the time; if that disk is there at all, it should be reporting its model and serial number and I think the kernel will find it. But I don't know for sure. I also don't know how the kernel acts if some disks take a while to show up, for example iSCSI disks.

(I suspect that the kernel only makes one attempt at pool activation and doesn't retry things if more devices show up later. But this entire area is pretty opaque to me.)

These days you also have your root filesystems on a ZFS pool, the root pool. There are definitely some special code paths that seem to be invoked during boot for a ZFS root pool, but I don't have enough knowledge of the Illumos boot time environment to understand how they work and how they're different from the process of loading and starting non-root pools. I used to hear that root pools were more fragile if devices moved around and you might have to boot from alternate media in order to explicitly 'zpool import' and 'zpool export' the root pool in order to reset its device names, but that may be only folklore and superstition at this point.


Comments on this page:

The root pool is definitely fragile, and that's still true today. Any attempt to boot a system with the pool in a different location will lead to a panic in vfs_mountroot.

This is an annoyance if you ever need to move a system disk from one machine to another, as the paths will likely change. Or if you change the mode of some controllers which has the same effect.

It also makes life very difficult if you're trying to bring up illumos on AWS. There, booting from "media" isn't really an option, and you don't have any console access to work around it. So you either need to start from an existing AMI with a working root pool, or build the image using Xen set up exactly the same way AWS is. Fortunately you only need to solve this once, but it's something I would rather not have had to solve at all.

Written on 16 August 2017.
« How to get per-user fair share scheduling on Ubuntu 16.04 (with systemd)
The three different names ZFS stores for each vdev disk (on Illumos) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Aug 16 00:36:46 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.