How we want to recover our ZFS pools from SAN outages
Last night I wrote about how I decided to sit on my hands after we had a SAN backend failure, rather than spring into sleepy action to swap in our hot spare backend. This turned out to be exactly the right decision for more than the obvious reasons.
In a SAN environment like ours it's quite possible to lose access to a whole bunch of disks without losing the disks themselves. This is what happened to us last night; the power supply on one disk shelf appears to have flaked out. We swapped out the disk shelf for another one, transplanted the disks themselves back into the new shelf, and the whole iSCSI backend was back on the air. ZFS had long since faulted all of the disks, of course (since it had spent hours being unable to talk to them), but the disks were still in their pools.
(Some RAID systems will actively eject disks from storage arrays if they are too faulted or if they disappear. ZFS doesn't do this. Those disks are in their pools until you remove them yourself.)
With the disks still in their pools, we could use '
to re-activate them (it's an underdocumented side effect of clearing
errors). ZFS was smart enough to know that the disks already had
most of the pool data and just needed relatively minimal resilvering,
which is a lot faster than the full resilvering that pulling in spares
needs. Once we had the disks powered up again it took perhaps an
hour until all of the pools had their redundancy back (and part of
that time was us being cautious about IO load).
In some environments this alone might be sufficient, but we've had
prior experience that it isn't good enough;
we also need to '
zpool scrub' each pool until it reports no errors
(this is now in progress). Doing scrubs takes rather a while but
at least all the pools have (relatively) full redundancy in the
(Part of the reason for needing to scrub our disks is that our disks probably have missing writes due to abruptly losing power.)
This sort of recovery is obviously a lot faster, less disruptive, and safer than resilvering terabytes of data by switching over to our hot spare backend (especially if we actively detach the disks from the 'failed' backend before the resilvering has finished). In the future I think we're going to want to recover failed iSCSI backends this way if at all possible. It may be somewhat more manual work (and it requires hands-on attention to swap hardware around) but it's much faster and better.
(In this specific case delaying ten hours or so probably saved us at least a couple of days of resilvering time, during which we would have had several terabytes exposed to single disk failures.)