Wandering Thoughts archives

2011-07-20

Our ZFS spares handling system (part 3)

In part 1 I mentioned that our spares system pulls what disks to use as spares from files and how the files are maintained was beyond the scope of the entry. Well, time to talk about that.

From more or less the beginning of our ZFS fileserver system we've had an administrative system that captured a record of all pools on each physical server and a list of all disks visible to that server and how they were being used by ZFS. This system is relatively crude; it's shell scripts that run once a day and then scp their output to a central location (which is then replicated back to all fileservers). By combining the information from all of the local disk usage files, the spares file building system can get a global view of both what disks exist and how they're used.

(At this point I will pause to note that all through our system we translate iSCSI disk names from the local Solaris cNt... names to symbolic names that have the iSCSI target and logical disk involved. This is a hugely important step and avoids so many potential problems.)

Although we have disk usage information for physical servers, the spares files are built for our virtual fileservers; each virtual fileserver has its own list of spares, even if two virtual fileservers happen to be using the same physical server at the moment. We do this because each of our iSCSI backends is typically dedicated to a single virtual fileserver and we want to keep things that way even when we have to activate spares. The overall spares handling environment goes to some pains to make this work.

The whole process of building the spares files for the virtual fileservers is controlled by a configuration file with directives. There are two important sorts of directives in the file:

fs8 use backend lincoln

This means that the virtual fileserver fs8 should use as spare disks any unused disks on the iSCSI backend called lincoln.

all exclude pool fs2-core-01

This means that all virtual fileservers should avoid (as spares) any logical disks that share a physical disk with a disk used by the ZFS pool fs2-core-01 (which happens to be the pool that hosts our /var/mail, which is quite sensitive to the increased IO load of a resilver).

(There are variants of these directives that allow us to be more specific about things, but in practice we don't need them.)

The spares-files build process is run on a single machine from cron, normally once a day. This low frequency of automated rebuilds is generally perfectly fine because disk usage information changes only very slowly. If we're replacing a backend there is a series of steps we have to do by hand to get the spares files rebuilt promptly, but that's an exceptional circumstance.

In theory we could have put all of this initial spares selection logic straight into the spares handling program. In practice, I feel that there's a very strong reason to keep them separate (in addition to this making both aspects of the spares problem simpler). Since we want each potential spare disk to only ever be usable by a single virtual fileserver, overall spares selection is inherently a global process. Global processes should be done once and centrally, because this avoids any chance that two systems will ever do them separately and disagree over what the answer should be. If we only ever generate spare disk lists in one place, we have a strong assurance that only serious program bugs will ever cause two fileservers to think that they can use the same disk as a spare. If the fileservers did this themselves, there are all sorts of (de)synchronization issues that could cause such duplication.

(We can also post-process the output files to check that this constraint holds true.)

ZFSOurSparesSystemIII written at 00:16:01; Add Comment

2011-07-04

Our ZFS spares handling system (part 2)

In part 1, I casually tossed off that our new spares handling system 'reads out the state of all of the pools [and] sees if any disks need replacement', as if it's a trivial thing to do. In fact, it's not; one of ZFS's problems is that it doesn't export this information in any way that's particularly useful for other programs. As a result, my quick summary papered over what is in many ways the most complex part of our spares system.

It turned out that the easiest way to do all of this was to split it into two portions. The first part is a C program (called zpstatus) that reads out the state of ZFS pools by making direct calls to libzfs to get the pool's configuration nvlist and then dumping the information in it. Walking an nvlist is the easy part, but ZFS encodes various pieces of important information in opaque ways that require calling more internal libzfs routines in order to decode them. I worked out what to do and what various structure elements meant by reading the OpenSolaris zpool and zdb source. This C program doesn't attempt to do any analysis of pool state; its job is purely to read out factual state information and report it (along with the pool configuration). It produces output in pseudo-JSON, because this was the easiest low-rent way to transport the data into the other portion.

(Since all of this is reverse engineered internal interfaces, it's brittle; Solaris can and does change things around in patches, and OpenSolaris code doesn't necessarily correspond to our Solaris release. I feel that such a program is still justified because the information is too valuable and too hard to obtain in any other way. I may have cause to regret this if Oracle doesn't release Solaris 11 source code and we ever upgrade to it.)

The analysis of pool state is done in Python code, because it's much easier to do tree walking, tree manipulation, and so on in a high level language than in C. The code starts with the raw information from zpstatus and progressively annotates it with additional information such as the redundancy state of vdevs, if they are involved in resilvers, the state of a resilvering or replacing operation, and so on. At the end of the process it works out whole pool information such as how healthy the pool is and how many disks are needed to bring the pool back to full redundancy.

(ZFS puts some of this information in the pool and vdev status, but we've found that we disagree with ZFS's status for vdevs in some cases so we wind up mostly ignoring ZFS's claimed status for anything except actual disks.)

Figuring out the status determination logic took a certain amount of work, partly because I needed to slant the status of vdevs towards the needs of the spares code instead of just reporting it. This means, for example, that we want to report a pool where disks are resilvering as more or less healthy if (and only if) the pool will be redundant if all of the resilvers finish successfully. I think that the end result wound up fairly simple, but there were certainly some false steps along the way where I wrote code that reported correct but useless results.

Because this code is walking the tree anyways, it also transforms the raw ZFS device names into more useful forms; in our environment, this includes translating the cumbersome and opaque MPxIO device names for iSCSI target devices into short form names that are our preferred notation for them. The code also allows us to artificially declare certain disks to be dead, which is useful both for testing and for taking disks out of service before they've actually faulted out.

(I call zpstatus's output 'pseudo-JSON' because I'm not sure if it's actually correctly formatted, specification compliant JSON. For reasons that don't fit within the margins of this entry, it's not read by a JSON parsing library.)

Sidebar: on C versus Python here

I started out planning to write the entire ZFS pool status reporter in C. However, it rapidly became obvious that we couldn't just use ZFS's own vdev status information directly and that doing a good job required significant analysis of the raw status information. I briefly considered doing this analysis in the C code, but while it was possible it would involve a whole lot of code to do various things with the nvpair information that is the pool configuration. Nvlists and nvpairs are basically associative dictionaries with key/value entries, and Python already has that data structure as a primitive.

(Using Python also avoids a lot of low level bookkeeping and other noise that would be required in C code that deals a lot with strings and string manipulation.)

ZFSOurSparesSystemII written at 02:38:25; Add Comment

By day for July 2011: 4 20; before July; after July.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.