Wandering Thoughts archives

2011-06-29

Our ZFS spares handling system (part 1)

I've mentioned before that we replaced ZFS's normal spares handling with our own system, but I've never actually described that system. For background, see my earlier entry on our issues with ZFS's normal spares handling and the challenges we have with spares in our ZFS fileserver environment.

There are two ways to implement a spares system. Let us call them event driven and state driven. In an event driven spares system you activate spares in response to fault events, and one of your challenges is to somehow notice them; in Solaris, you could do this by writing a fmd module or by watching syslog for appropriate messages. In a state driven spares system, you activate spares when the pool state is damaged, and one of your challenges is to decode pool state and decide when it's damaged. The normal ZFS spares system is event driven (cf). We opted to make our new spares system a state driven system, because we felt that decoding pool state was both simpler (even with Solaris's roadblocks) and more reliable than trying to catch events.

The main spares handling program (which is called sanspares) is given a list of pools to check and a list of potential spare disks that can be used as spares if they're needed. It reads out the state of all of the pools, sees if any disks need replacement, and if there are disks that need replacement and there are not too many resilvers already running, it picks the highest priority disk to replace and the best available potential spare disk and issues the zpool commands to do it.

(Normally we only do one resilver at a time, but sanspares can be told to do several in parallel if we want to. This trades off worse fileserver performance for a faster return to full redundancy; we typically do this if we have a significant iSCSI backend failure.)

Roughly speaking, sanspares' goal is to get back as much pool level redundancy as fast as possible; it prioritizes resilvering based on how many disks are needed to make the pool fully redundant, and then on how much data needs to be resilvered (lower is better in each case). Because it's specific to our environment, it's aware of our SAN topology and has a whole collection of heuristics to pick what we consider the topologically best potential spare.

Potential spare disks are not configured as ZFS-level spares before they are actually activated; instead, we keep track of them in files. To avoid two machines trying to simultaneously use the same disk as a spare, each potential spare disk can only be a potential spare disk for a single (virtual) fileserver (on a single machine, both sanspares and ZFS itself will make sure that a potential spare disk is only ever used by one pool). Because each virtual fileserver has a pair of iSCSI backends that no other fileserver uses (to simplify slightly), we have simply decided that any unused disk on the fileserver's pair of backends is a potential spare for that fileserver.

(The files of potential spares are automatically generated by a separate program in a way that is sufficiently complex to not fit within the margins of this entry.)

Sanspares has a number of additional options, of which two are particularly notable. First, it can be told to always consider certain disks to be bad; we use this if for some reason we no longer trust a specific backend and want to migrate all disk storage away from it. Next, it can be run in a mode where it will add mirrors to any non-redundant vdev instead of merely replacing damaged disks; this means that we don't have to keep disks from a problematic backend attached to pools in order to have them replaced.

Sanspares doesn't run all the time. Instead, we run it periodically (currently every fifteen minutes) as part of our 'frequent things' processing. Given how long resilvers typically take in our environment, we feel that this potential delay on starting a resilver is extremely unlikely to make a difference in the outcome of a failure.

Sidebar: more or less the exact SAN topology heuristics

For illustrative purposes, the current list of topology heuristics for picking a spare disk is more or less:

  1. pick a spare from the same iSCSI backend as the failed disk, if possible; this is not applicable if we're adding redundancy instead of replacing a failed disk.
  2. pick a 'symmetric' spare on a different backend from any existing disk in the vdev; a symmetric spare is one with the same physical disk number and LUN.
  3. pick a spare on a different backend from any existing disk in the pool (or the vdev if there is no disk that doesn't conflict with the pool as a whole).
  4. failing even that, pick any spare.

('disk' here is disks as seen by the Solaris fileservers, not the physical disks on the iSCSI backends.)

You might think that the second heuristic could never come up, but in fact it's what happens when we are replacing an entire backend.

PS: in the process of writing this entry I've determined that I need to add more comments to this code. So it goes.

ZFSOurSparesSystemI written at 01:56:39; Add Comment

2011-06-23

The ZFS opacity problem and its effect on manageability

I've alluded to this before in passing, but one of my great frustrations with ZFS is that it has basically no public interface or API for getting information about its state; it is an almost completely opaque box. Do you want to know configuration information about your pools? Do you want to know state information about how healthy or damaged they are? Tough. You can't have it, or rather your programs and systems can't have it. Not with a public, reliable interface at any rate. There is no library that you can call, no program that dumps out comprehensive information in a parseable and reliable format. In short, ZFS is just not very observable.

Oh, sure, Solaris sort of makes some of the information you want available in the form of zpool status. But zpool is a frontend not a tool; its output intended for people to read, and so the information is incomplete and Solaris developers feel free to change the output around to make it look better. (And it sometimes helpfully lies to you.)

This opacity hurts. It hurts monitoring systems, which have no good reliable way of watching ZFS pool status so that they can do things like email you if a pool degrades. It hurts tools that want to live on top of ZFS in complex environments (such as SANs) in order to do things like check that your disk usage and layout constraints are being respected. It hurts attempts to add more sophisticated (and site-local) handling of things like spares replacement. It even hurts things like site inventories, where you want to make sure that you have a complete and accurate record of the filesystem setup on every server.

Solaris itself cannot possibly provide everything that everyone needs for ZFS management, and I wish that ZFS would stop trying to pretend otherwise.

Sidebar: extracting information from ZFS

If your systems need this information anyways, the current state of ZFS gives you two equally unappetizing choices. First, you can parse the output of zpool status and other ZFS commands, hoping that you can get what you need and can make the resulting lash-up work reliably. Second, you can use undocumented interfaces to directly get the information, at the cost of dealing with changes in them. (This was a lot easier in the days when OpenSolaris source code was being updated.)

We've done both. I'm not happy with either.

ZFSOpacityProblem written at 01:01:30; Add Comment

By day for June 2011: 23 29; before June; after June.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.