2010-10-24
Why we built our own ZFS spares handling system
I mentioned recently that we've written our own system to handle ZFS spares. Before I describe it, I wanted to write up something about why we decided to go to the extreme measure of discarding all of ZFS's own spare handling and rolling our own.
First off, note that our environment is unusual. We have a lot of pools and a relatively complex SAN disk topology with at least three levels, as opposed to the more common environment of only a few pools and essentially undifferentiated disks. I expect that ZFS's current spares system works much better in the latter situation, especially if you don't have many spare disks.
Our issues with ZFS's current spare system include:
- it has outright bugs with shared spares, some of them fixed and others
not (we had our selfish pool, for example).
- because of how ZFS handles spares, we've seen
ZFS not activate spares in situations where we wanted them activated.
- ZFS has no concept of load limits on spares activations. This presents
us with an unenviable tradeoff; either we artificially limit the number
of spares we configure or we can have our systems crushed under the load
of multiple simultaneous resilvers.
(We've seen this happen.)
- ZFS doesn't know how we want to handle the situation where there are
too few spares to replace all of the faulted disks; instead it will
just deploy spares essentially randomly. (This also combines with the
above issue, of course.)
- there's no way to tell ZFS about our multi-level disk topology, where there are definitely good and bad disks to replace a given faulted disk with.
Many of these are hard problems that involve local policy decisions, so I don't expect ZFS to solve them out of the box. Instead ZFS's current spares system deals with the common case; it just happens that the common case is not a good fit for our environment.
(I do fault ZFS for having no support for this sort of local additions. I don't necessarily expect a nice modular plugin system, but it would be nice if ZFS had official interfaces for extracting information in ways that are useful for third party programs. But that's really another entry.)
2010-10-02
ZFS resilvers are a whole-pool activity
In a conventional RAID system with a RAID array that's made up from multiple mirrors, mirror resynchronization is a single-mirror affair. Other mirrors in the array are not affected. This is not how ZFS works.
One of the consequences of ZFS scrubs and resilvers being nonlinear is that resilvers do not neatly confine their activity to only the disks of the vdev being resilvered. Instead, ZFS may need to traverse data structures that live in other vdevs in order to find out what data is live on the resilvering vdev. (However, ZFS does try to do as little extra IO as possible.)
This makes a resilver a whole pool affair (which is really what you'd expect, given that scrubs and resilvers use basically the same code). The most important consequence of this is that starting a resilver on a second vdev restarts an ongoing resilver from the beginning, no matter how close the existing resilver was to completion.
So: if you have disk failure in one mirror vdev, activate a spare, and then have a second disk fail in another mirror and activate another spare, work on resilvering the first spare will immediately restart from scratch. Depending on how fast your resilver goes, this may cost you a significant amount of time. This is unlike traditional RAID systems, where you can start a new mirror resync on one mirror without doing anything to an almost-complete mirror resync on another mirror.
(We have pools that take hours to resilver, as we found out recently. And yes, I wound up restarting a resilver from scratch in just this way, losing a chunk of time in the process.)
This has obvious implications for how you want to deal with disk failures, whether in scripts or by hand. Also, I think that there is no universal answer on whether or not to abort an existing resilver in order to start an additional one, although with sufficient MTTDL math, you can probably work out the mathematic answer (based on whatever MTTDL model and numbers you wanted to use). This implies that there is no single right answer that can be coded into Solaris for you; there's always local policy decisions and risk factors to be considered.
(In part the balance is between time to return a pool to total redundancy and time to return each vdev to redundancy. If you wait for your first resilver to finish, you get partial redundancy faster and total redundancy slower.)