The right way to fix ZFS disk glitches (at least for us)

May 5, 2010

Every so often in our environment of ZFS pools with mirrored vdevs, we will have an iSCSI disk drop out temporarily. When this happens, ZFS winds up faulting the disk with read and write errors, and you get to fix this after the disk is back.

In theory, this is fixed with just 'zpool clear <pool> <disk>'. In practice, our experience is that this will sometimes leave the disk with latent checksum errors (I presume from writes that somehow got lost on the way to the disk without anything noticing), so in order to completely fix up the situation we must then 'zpool scrub' the pool, possibly repeatedly, until there are no errors being reported.

This is kind of annoying, plus it puts an IO load on the entire pool (and can take ages on a big pool). So our alternate, simpler procedure has been to 'zpool detach' and then 'zpool attach' the glitched disk; once the resilver is done, this is guaranteed to have the disk fully intact. Also, the IO load is much more controllable since we are effectively only 'scrubbing' one disk, instead of all disks in the pool at once.

(You might think that this is crazy, but the logic is that we can't trust the glitched disk since we're assuming that it has missed writes; until it's repaired, the vdev is not truly redundant regardless of what ZFS thinks.)

In retrospect, there is a strong (and obvious) reason to prefer the zpool clear approach, even if it takes longer and is more annoying. Even though we can't completely trust the data on the glitched disk, in most situations most of it is still intact and good. The moment we do 'zpool detach', we discard all of that good data. If the vdev is only a two-way mirror, we go from a situation where we were non-redundant on only the missing writes to a situation where we are non-redundant on an entire disk's worth of data (and where ZFS has a much worse potential failure mode).

(How much good data is left on the glitched disk depends on how fast data turns over in the pool and how long the disk was out for.)

In a multi-way mirror that's still fully redundant even without the glitched disk, we might as well use the simpler approach. But with a two-way mirror, we really do want to use the longer, more annoying approach in situations where it's feasible.

(This is the kind of entry that I write to convince myself that I have the logic nailed down, so I can explain it to other people.)

PS: note that our experience is that there are potentially significant IO load differences between scrubbing and resilvering that may affect this choice. Scrubbing is almost entirely reads across all pool devices; resilvering is write heavy to the new disk, and in theory only read heavy on the other mirror(s) in that particular vdev. I believe that resilvering IO may also be considered higher priority than scrub IO. Both scrubs and resilvering are at least somewhat random IO, not strictly sequential, for reasons that do not fit in within the margins of this entry.

Written on 05 May 2010.
« Dear software packagers, startup scripts edition
The right way and the wrong way to disable init.d services »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed May 5 04:23:33 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.