Our ZFS spares handling system (part 1)

June 29, 2011

I've mentioned before that we replaced ZFS's normal spares handling with our own system, but I've never actually described that system. For background, see my earlier entry on our issues with ZFS's normal spares handling and the challenges we have with spares in our ZFS fileserver environment.

There are two ways to implement a spares system. Let us call them event driven and state driven. In an event driven spares system you activate spares in response to fault events, and one of your challenges is to somehow notice them; in Solaris, you could do this by writing a fmd module or by watching syslog for appropriate messages. In a state driven spares system, you activate spares when the pool state is damaged, and one of your challenges is to decode pool state and decide when it's damaged. The normal ZFS spares system is event driven (cf). We opted to make our new spares system a state driven system, because we felt that decoding pool state was both simpler (even with Solaris's roadblocks) and more reliable than trying to catch events.

The main spares handling program (which is called sanspares) is given a list of pools to check and a list of potential spare disks that can be used as spares if they're needed. It reads out the state of all of the pools, sees if any disks need replacement, and if there are disks that need replacement and there are not too many resilvers already running, it picks the highest priority disk to replace and the best available potential spare disk and issues the zpool commands to do it.

(Normally we only do one resilver at a time, but sanspares can be told to do several in parallel if we want to. This trades off worse fileserver performance for a faster return to full redundancy; we typically do this if we have a significant iSCSI backend failure.)

Roughly speaking, sanspares' goal is to get back as much pool level redundancy as fast as possible; it prioritizes resilvering based on how many disks are needed to make the pool fully redundant, and then on how much data needs to be resilvered (lower is better in each case). Because it's specific to our environment, it's aware of our SAN topology and has a whole collection of heuristics to pick what we consider the topologically best potential spare.

Potential spare disks are not configured as ZFS-level spares before they are actually activated; instead, we keep track of them in files. To avoid two machines trying to simultaneously use the same disk as a spare, each potential spare disk can only be a potential spare disk for a single (virtual) fileserver (on a single machine, both sanspares and ZFS itself will make sure that a potential spare disk is only ever used by one pool). Because each virtual fileserver has a pair of iSCSI backends that no other fileserver uses (to simplify slightly), we have simply decided that any unused disk on the fileserver's pair of backends is a potential spare for that fileserver.

(The files of potential spares are automatically generated by a separate program in a way that is sufficiently complex to not fit within the margins of this entry.)

Sanspares has a number of additional options, of which two are particularly notable. First, it can be told to always consider certain disks to be bad; we use this if for some reason we no longer trust a specific backend and want to migrate all disk storage away from it. Next, it can be run in a mode where it will add mirrors to any non-redundant vdev instead of merely replacing damaged disks; this means that we don't have to keep disks from a problematic backend attached to pools in order to have them replaced.

Sanspares doesn't run all the time. Instead, we run it periodically (currently every fifteen minutes) as part of our 'frequent things' processing. Given how long resilvers typically take in our environment, we feel that this potential delay on starting a resilver is extremely unlikely to make a difference in the outcome of a failure.

Sidebar: more or less the exact SAN topology heuristics

For illustrative purposes, the current list of topology heuristics for picking a spare disk is more or less:

  1. pick a spare from the same iSCSI backend as the failed disk, if possible; this is not applicable if we're adding redundancy instead of replacing a failed disk.
  2. pick a 'symmetric' spare on a different backend from any existing disk in the vdev; a symmetric spare is one with the same physical disk number and LUN.
  3. pick a spare on a different backend from any existing disk in the pool (or the vdev if there is no disk that doesn't conflict with the pool as a whole).
  4. failing even that, pick any spare.

('disk' here is disks as seen by the Solaris fileservers, not the physical disks on the iSCSI backends.)

You might think that the second heuristic could never come up, but in fact it's what happens when we are replacing an entire backend.

PS: in the process of writing this entry I've determined that I need to add more comments to this code. So it goes.

Written on 29 June 2011.
« How to securely run programs from inside your program on Unix
Please have symmetric option negotiations in your protocols »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jun 29 01:56:39 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.