Our ZFS spares handling system (part 2)

July 4, 2011

In part 1, I casually tossed off that our new spares handling system 'reads out the state of all of the pools [and] sees if any disks need replacement', as if it's a trivial thing to do. In fact, it's not; one of ZFS's problems is that it doesn't export this information in any way that's particularly useful for other programs. As a result, my quick summary papered over what is in many ways the most complex part of our spares system.

It turned out that the easiest way to do all of this was to split it into two portions. The first part is a C program (called zpstatus) that reads out the state of ZFS pools by making direct calls to libzfs to get the pool's configuration nvlist and then dumping the information in it. Walking an nvlist is the easy part, but ZFS encodes various pieces of important information in opaque ways that require calling more internal libzfs routines in order to decode them. I worked out what to do and what various structure elements meant by reading the OpenSolaris zpool and zdb source. This C program doesn't attempt to do any analysis of pool state; its job is purely to read out factual state information and report it (along with the pool configuration). It produces output in pseudo-JSON, because this was the easiest low-rent way to transport the data into the other portion.

(Since all of this is reverse engineered internal interfaces, it's brittle; Solaris can and does change things around in patches, and OpenSolaris code doesn't necessarily correspond to our Solaris release. I feel that such a program is still justified because the information is too valuable and too hard to obtain in any other way. I may have cause to regret this if Oracle doesn't release Solaris 11 source code and we ever upgrade to it.)

The analysis of pool state is done in Python code, because it's much easier to do tree walking, tree manipulation, and so on in a high level language than in C. The code starts with the raw information from zpstatus and progressively annotates it with additional information such as the redundancy state of vdevs, if they are involved in resilvers, the state of a resilvering or replacing operation, and so on. At the end of the process it works out whole pool information such as how healthy the pool is and how many disks are needed to bring the pool back to full redundancy.

(ZFS puts some of this information in the pool and vdev status, but we've found that we disagree with ZFS's status for vdevs in some cases so we wind up mostly ignoring ZFS's claimed status for anything except actual disks.)

Figuring out the status determination logic took a certain amount of work, partly because I needed to slant the status of vdevs towards the needs of the spares code instead of just reporting it. This means, for example, that we want to report a pool where disks are resilvering as more or less healthy if (and only if) the pool will be redundant if all of the resilvers finish successfully. I think that the end result wound up fairly simple, but there were certainly some false steps along the way where I wrote code that reported correct but useless results.

Because this code is walking the tree anyways, it also transforms the raw ZFS device names into more useful forms; in our environment, this includes translating the cumbersome and opaque MPxIO device names for iSCSI target devices into short form names that are our preferred notation for them. The code also allows us to artificially declare certain disks to be dead, which is useful both for testing and for taking disks out of service before they've actually faulted out.

(I call zpstatus's output 'pseudo-JSON' because I'm not sure if it's actually correctly formatted, specification compliant JSON. For reasons that don't fit within the margins of this entry, it's not read by a JSON parsing library.)

Sidebar: on C versus Python here

I started out planning to write the entire ZFS pool status reporter in C. However, it rapidly became obvious that we couldn't just use ZFS's own vdev status information directly and that doing a good job required significant analysis of the raw status information. I briefly considered doing this analysis in the C code, but while it was possible it would involve a whole lot of code to do various things with the nvpair information that is the pool configuration. Nvlists and nvpairs are basically associative dictionaries with key/value entries, and Python already has that data structure as a primitive.

(Using Python also avoids a lot of low level bookkeeping and other noise that would be required in C code that deals a lot with strings and string manipulation.)

Written on 04 July 2011.
« Why Ubuntu's PAM versioning failure matters
Another reason to allow mail origin address forgery »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jul 4 02:38:25 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.