Our ZFS spares handling system (part 1)
I've mentioned before that we replaced ZFS's normal spares handling with our own system, but I've never actually described that system. For background, see my earlier entry on our issues with ZFS's normal spares handling and the challenges we have with spares in our ZFS fileserver environment.
There are two ways to implement a spares system. Let us call them event driven and state driven. In an event driven spares system you activate spares in response to fault events, and one of your challenges is to somehow notice them; in Solaris, you could do this by writing a fmd module or by watching syslog for appropriate messages. In a state driven spares system, you activate spares when the pool state is damaged, and one of your challenges is to decode pool state and decide when it's damaged. The normal ZFS spares system is event driven (cf). We opted to make our new spares system a state driven system, because we felt that decoding pool state was both simpler (even with Solaris's roadblocks) and more reliable than trying to catch events.
The main spares handling program (which is called
sanspares) is given
a list of pools to check and a list of potential spare disks that can be
used as spares if they're needed. It reads out the state of all of the
pools, sees if any disks need replacement, and if there are disks that
need replacement and there are not too many resilvers already running,
it picks the highest priority disk to replace and the best available
potential spare disk and issues the
zpool commands to do it.
(Normally we only do one resilver at a time, but
be told to do several in parallel if we want to. This trades off
worse fileserver performance for a faster return to full redundancy;
we typically do this if we have a significant iSCSI backend failure.)
sanspares' goal is to get back as much pool level
redundancy as fast as possible; it prioritizes resilvering based on how
many disks are needed to make the pool fully redundant, and then on
how much data needs to be resilvered (lower is better in each case).
Because it's specific to our environment, it's aware of our SAN topology
and has a whole collection of heuristics to pick what we consider the
topologically best potential spare.
Potential spare disks are not configured as ZFS-level spares before they
are actually activated; instead, we keep track of them in files. To
avoid two machines trying to simultaneously use the same disk as a
spare, each potential spare disk can only be a potential spare disk for
a single (virtual) fileserver (on a single machine, both
and ZFS itself will make sure that a potential spare disk is only ever
used by one pool). Because each virtual fileserver has a pair of iSCSI
backends that no other fileserver uses (to
simplify slightly), we have simply decided that any unused disk on the
fileserver's pair of backends is a potential spare for that fileserver.
(The files of potential spares are automatically generated by a separate program in a way that is sufficiently complex to not fit within the margins of this entry.)
Sanspares has a number of additional options, of which two are particularly notable. First, it can be told to always consider certain disks to be bad; we use this if for some reason we no longer trust a specific backend and want to migrate all disk storage away from it. Next, it can be run in a mode where it will add mirrors to any non-redundant vdev instead of merely replacing damaged disks; this means that we don't have to keep disks from a problematic backend attached to pools in order to have them replaced.
Sanspares doesn't run all the time. Instead, we run it periodically (currently every fifteen minutes) as part of our 'frequent things' processing. Given how long resilvers typically take in our environment, we feel that this potential delay on starting a resilver is extremely unlikely to make a difference in the outcome of a failure.
Sidebar: more or less the exact SAN topology heuristics
For illustrative purposes, the current list of topology heuristics for picking a spare disk is more or less:
- pick a spare from the same iSCSI backend as the failed disk, if possible; this is not applicable if we're adding redundancy instead of replacing a failed disk.
- pick a 'symmetric' spare on a different backend from any existing disk in the vdev; a symmetric spare is one with the same physical disk number and LUN.
- pick a spare on a different backend from any existing disk in the pool (or the vdev if there is no disk that doesn't conflict with the pool as a whole).
- failing even that, pick any spare.
('disk' here is disks as seen by the Solaris fileservers, not the physical disks on the iSCSI backends.)
You might think that the second heuristic could never come up, but in fact it's what happens when we are replacing an entire backend.
PS: in the process of writing this entry I've determined that I need to add more comments to this code. So it goes.