2009-10-12
ZFS GUIDs vs device names
One thing about ZFS GUIDs that I didn't get around to
covering is when commands like 'zpool status' are misleading you about
the device names that they show.
When things are fine, ie when ZFS can find the GUID that it is looking
for, the device name is correct (in fact hyper-correct, as 'zpool
status' also checks that the device ID is the right one). When things
are not fine, zpool will generally display the device name of the last
device that that GUID was found on; it does so even if the device name
doesn't actually exist on your system at the moment. (iSCSI devices are
good for this, since they can disappear if you sneeze on them.)
(Very recent versions of OpenSolaris will actually display the GUID and then add 'most recently seen on <device>', which is the far more honest and helpful approach. Apparently someone at Sun woke up.)
This usually doesn't matter for doing things to pool devices, because
zpool will use the pool configuration to map device names to the GUIDs
it has to give the kernel. This is what lets you remove nonexistent
devices (or at least usually lets you remove them, there are some just
fixed ZFS bugs concerning removing nonexistent spares). If you have a
situation where the same device name represents two
different GUIDs in the same pool, this breaks down but zpool doesn't
seem to notice. (I think you usually get the GUID of the first instance
of the device name, which is usually the broken one.)
2009-10-08
What is going on with faulted ZFS spares
Suppose that you have several pools with a shared spare disk. One day
you reboot your machine, and suddenly 'zpool status' for most of your
pools starts reporting that your spare is faulted:
# zpool status tank
pool: tank
[...]
config:
NAME STATE READ WRITE CKSUM
[...]
spares
cXtYdZ FAULTED corrupted data
Unless you are running a very recent version of OpenSolaris, you will
probably be unable to zpool remove these faulted spares. If this
happens to you, do not try re-adding the spare(s) to your pools.
What has probably happened to you is a ZFS disk GUID mismatch problem, where the pool configuration claims that the spare should have one GUID but the actual device has a valid ZFS label with another GUID. When the kernel discovers this situation, such as when it brings pools up during boot, it will throw up its hands and declare the spare faulted.
Fortunately, the problem turns out to be easy to cure; all you have to
do is zpool export and then zpool import each affected pool, because
zpool import rewrites the pool's spare configuration to have the spare
device's current GUID during the import process.
(If a spare device has no valid ZFS disk labels, zpool import will
fix that too. It's really a helpful command, perhaps a bit too helpful
in a SAN environment.)
Our theory about how this situation can happen naturally is that there
is some sort of race when adding the same spare to multiple pools in
close succession, such as if you do it from a script. (You can induce
the problem artificially by adding a spare to one pool, destroying its
ZFS disk labels with dd, and then adding it to another
pool, which will create new disk labels with a different GUID.)
The two ways we have seen of having the kernel choke on this situation are to either reboot your machine or to add another spare device to the pools (this apparently causes the kernel to re-check all spare devices and thus notice the inconsistency and fault the affected spares). If you don't reboot your machine or add more spares, a system with this problem can run for months without anything noticing it (which is what happened to us).
Sidebar: what re-adding the spares does
If you try to re-add the spares, you will get two unpleasant surprises;
first, it will work, and second, zpool import will no longer fix your
problem.
Remember how I said that ZFS really identifies disks by GUIDs? Well, if you re-added the faulted spare, you're seeing a
really vivid illustration of this in action. As far as ZFS is concerned,
you have two completely separate spares, with different GUIDs, that
just happen to think they should be found on the same device. Once this
happens, zpool import won't rewrite the GUID any more, presumably
because that would create a situation where there's a duplicate spare.
2009-10-07
A brief introduction to ZFS (disk) GUIDs
Although ZFS commands like 'zpool status' will generally not tell you
this, ZFS does not actually identify pool devices by their device name.
Well, mostly it doesn't. The real situation is somewhat complex.
ZFS likes identifying pool-related objects with what it calls a 'GUID',
a random and theoretically unique 64-bit identifier. Pools have GUIDs,
vdevs have GUIDs, and, specifically, disks (in pool configurations) have
GUIDs. ZFS internally uses the GUID for most operations; for instance,
almost all of the kernel interfaces that zpool uses actually take
GUIDs to identify what to change, instead of device names.
(The 'numeric identifier' that you can use to have 'zpool import'
import a pool is the pool's GUID.)
In a pool's configuration, entries for disks have a bunch of information
to help ZFS identify the right device: the GUID of the disk, the device
it's expected to be found on, and the physical path and device ID of
that device. You can see most of a pool's raw configuration, complete
with this information about each disk, with 'zdb -C <pool>'.
(Unfortunately, zdb doesn't print information about spare disks,
only about disks that are in vdevs.)
As you might guess, disks being used by ZFS have an on-disk label (in
fact they have four copies of it, two at the start of the ZFS slice and
two at the end). Among other things, this disk label has the disk's
GUID. You can dump a disk's ZFS label with 'zdb -l' on the ZFS slice
(normally slice 0, 's0', if you have given ZFS the full disk).
(On disks that are part of a live vdev, the disk label also has a copy of the vdev's information; on spare disks, all the label has is the disk's GUID, the version, and the state.)
Conceptually, 'zpool import <pool>' works by finding a copy of the
nominal full pool configuration and then searching all of the disks to
find the disk GUIDs mentioned in the pool configuration. If the system
can find enough of the disks, it can actually import and bring up the
pool.
(I'm not entirely clear where the full pool configuration is stored;
it's in the pool somewhere, but it's unfortunately not in the disk
labels, so it's not trivial to dump it with zdb.)
Note that ZFS GUIDs are not real GUIDs. Real GUIDs are 128-bit objects and are conventionally printed in a special format; ZFS GUIDs are only 64-bit ones and are conventionally printed as plain decimal numbers.
Sidebar: zdb versus the full pool configuration
What seems to be going on with 'zdb -C' is that it dumps out the
in-kernel pool configuration nvlist, and while
the kernel keeps track of spares and other things, it does not keep them
in the in-kernel pool config nvlist; instead it stuffs them into other
bits of data structures that zdb does not print out.
Things like 'zpool status' look at the full pool configuration, but
they don't print the raw nvlist; instead they helpfully process it for
you and hide various bits of what is going on.
(I wound up writing a full pool config nvlist dumper myself; you can get a copy of the current source code here.)