Why we built our own ZFS spares handling system

October 24, 2010

I mentioned recently that we've written our own system to handle ZFS spares. Before I describe it, I wanted to write up something about why we decided to go to the extreme measure of discarding all of ZFS's own spare handling and rolling our own.

First off, note that our environment is unusual. We have a lot of pools and a relatively complex SAN disk topology with at least three levels, as opposed to the more common environment of only a few pools and essentially undifferentiated disks. I expect that ZFS's current spares system works much better in the latter situation, especially if you don't have many spare disks.

Our issues with ZFS's current spare system include:

  • it has outright bugs with shared spares, some of them fixed and others not (we had our selfish pool, for example).

  • because of how ZFS handles spares, we've seen ZFS not activate spares in situations where we wanted them activated.

  • ZFS has no concept of load limits on spares activations. This presents us with an unenviable tradeoff; either we artificially limit the number of spares we configure or we can have our systems crushed under the load of multiple simultaneous resilvers.

    (We've seen this happen.)

  • ZFS doesn't know how we want to handle the situation where there are too few spares to replace all of the faulted disks; instead it will just deploy spares essentially randomly. (This also combines with the above issue, of course.)

  • there's no way to tell ZFS about our multi-level disk topology, where there are definitely good and bad disks to replace a given faulted disk with.

Many of these are hard problems that involve local policy decisions, so I don't expect ZFS to solve them out of the box. Instead ZFS's current spares system deals with the common case; it just happens that the common case is not a good fit for our environment.

(I do fault ZFS for having no support for this sort of local additions. I don't necessarily expect a nice modular plugin system, but it would be nice if ZFS had official interfaces for extracting information in ways that are useful for third party programs. But that's really another entry.)

Comments on this page:

From at 2010-10-25 04:03:30:

Hi Chris,

How do you deal with notifications of dead disks in pools ?

By cks at 2010-10-26 00:13:02:

The short answer is that we don't try to get notified about disk failures; instead we poll the pool state periodically to see if there are failed disks. There are issues with this but it lets us sidestep fmd entirely.

From at 2010-10-26 10:56:52:

can you please elaborate on that ? i have heard that using snmp or grep'ing through messages is how people do it. however, what do you mean ? can you paste an example please ?


By cks at 2010-10-26 15:01:44:

In theory you can check the state of ZFS pools by parsing the output of 'zpool status'. In practice this has various problems, so I wrote a program to grind through the actual ZFS pool configuration structures from the kernel and dump the information in a much rawer format.

(Note that there are huge issues with this approach, especially with Oracle no longer updating the OpenSolaris source code, since the ZFS pool configuration is entirely undocumented and the format of information in it changes periodically.)

Written on 24 October 2010.
« Why I'm interested in Go
Why Python's struct module isn't great for handling binary protocols »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Oct 24 21:11:28 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.