Our ZFS spares handling system (part 2)
In part 1, I casually tossed off that our new spares handling system 'reads out the state of all of the pools [and] sees if any disks need replacement', as if it's a trivial thing to do. In fact, it's not; one of ZFS's problems is that it doesn't export this information in any way that's particularly useful for other programs. As a result, my quick summary papered over what is in many ways the most complex part of our spares system.
It turned out that the easiest way to do all of this was to split it
into two portions. The first part is a C program (called
that reads out the state of ZFS pools by making direct calls to
to get the pool's configuration nvlist and
then dumping the information in it. Walking an nvlist is the easy part,
but ZFS encodes various pieces of important information in opaque ways
that require calling more internal
libzfs routines in order to decode
them. I worked out what to do and what various structure elements meant
by reading the OpenSolaris
zdb source. This C program
doesn't attempt to do any analysis of pool state; its job is purely to
read out factual state information and report it (along with the pool
configuration). It produces output in pseudo-JSON, because this was the
easiest low-rent way to transport the data into the other portion.
(Since all of this is reverse engineered internal interfaces, it's brittle; Solaris can and does change things around in patches, and OpenSolaris code doesn't necessarily correspond to our Solaris release. I feel that such a program is still justified because the information is too valuable and too hard to obtain in any other way. I may have cause to regret this if Oracle doesn't release Solaris 11 source code and we ever upgrade to it.)
The analysis of pool state is done in Python code, because it's much
easier to do tree walking, tree manipulation, and so on in a high level
language than in C. The code starts with the raw information from
zpstatus and progressively annotates it with additional information
such as the redundancy state of vdevs, if they are involved in
resilvers, the state of a resilvering or replacing operation, and so on.
At the end of the process it works out whole pool information such as
how healthy the pool is and how many disks are needed to bring the pool
back to full redundancy.
(ZFS puts some of this information in the pool and vdev status, but we've found that we disagree with ZFS's status for vdevs in some cases so we wind up mostly ignoring ZFS's claimed status for anything except actual disks.)
Figuring out the status determination logic took a certain amount of work, partly because I needed to slant the status of vdevs towards the needs of the spares code instead of just reporting it. This means, for example, that we want to report a pool where disks are resilvering as more or less healthy if (and only if) the pool will be redundant if all of the resilvers finish successfully. I think that the end result wound up fairly simple, but there were certainly some false steps along the way where I wrote code that reported correct but useless results.
Because this code is walking the tree anyways, it also transforms the raw ZFS device names into more useful forms; in our environment, this includes translating the cumbersome and opaque MPxIO device names for iSCSI target devices into short form names that are our preferred notation for them. The code also allows us to artificially declare certain disks to be dead, which is useful both for testing and for taking disks out of service before they've actually faulted out.
zpstatus's output 'pseudo-JSON' because I'm not sure if
it's actually correctly formatted, specification compliant JSON.
For reasons that don't fit within the margins of this entry, it's
not read by a JSON parsing library.)
Sidebar: on C versus Python here
I started out planning to write the entire ZFS pool status reporter in C. However, it rapidly became obvious that we couldn't just use ZFS's own vdev status information directly and that doing a good job required significant analysis of the raw status information. I briefly considered doing this analysis in the C code, but while it was possible it would involve a whole lot of code to do various things with the nvpair information that is the pool configuration. Nvlists and nvpairs are basically associative dictionaries with key/value entries, and Python already has that data structure as a primitive.
(Using Python also avoids a lot of low level bookkeeping and other noise that would be required in C code that deals a lot with strings and string manipulation.)