Thoughts on when to replace disks in a ZFS pool

May 8, 2013

One of the morals that you can draw from our near miss that I described in yesterday's entry, where we might have lost a large pool if things had gone a bit differently, is that the right time to replace a disk with read errors is TODAY. Do not wait. Do not put it off because things are going okay and you see no ZFS-level errors after the dust settles. Replace it today because you never know what is going to happen to another disk tomorrow.

Well, maybe. Clearly the maximally cautious approach is to replace a disk any time it reports a hard read error (ie one that is seen at the ZFS layer) or SMART reports an error. But the problem with this for us is that we'd be replacing a lot of disks and at least some of them may be good (or at least perfectly workable). For read errors, our experience is that some but not all reported read errors are transient errors in that they don't happen again if you do something like (re)scrub the pool. And SMART error reports seem relatively uncorrelated with actual errors reported by the backend kernels or seen by ZFS.

In theory you could replace these potentially questionable disks, test them thoroughly, and return them to your spares pool if they pass your tests. In practice this would add more and more questionable disks to your spares pool and, well, do you really trust them completely? I wouldn't. This leaves either demoting them to some less important role (if you have one that can use a potentially significant number of disks, and maybe you do) or trying to return them to the vendor for a warranty claim (and I don't know if the vendor will take them back under that circumstance).

I don't have a good answer to this. Our current (new) approach is to replace disks that have persistent read errors. On the first read error we clear the error and schedule a pool scrub; if the disk then reports more read errors (during the scrub, before the scrub, or in the next while after the scrub), it gets replaced.

(This updates some of our past thinking on when to replace disks. The general discussion there is still valid.)


Comments on this page:

From 74.125.59.49 at 2013-05-09 20:04:26:

It's probably worth noting that modern disks (basically everything since 1997 or thereabouts) have a pool of unreported space, reserved for the sole use of the firmware to secretly remap bad sectors behind the OS's back. AIUI, modern drives are so big that there's no reasonable way to make them perfect, so they use an ECC scheme (Reed-Solomon... or now LDPC, apparently) instead of a boring old checksum. When a sufficient amount of error correction is needed just to read back a sector, the firmware chooses between (a) an in-place read-recompute-overwrite to refresh the ECC, i.e. a bet that the original sector's physical medium is fine, or else (b) a remap of that sector's LBA to one of the hidden sectors, i.e. a bet that the hidden sector's physical medium is fine.

SMART is supposed to report whenever one of these events happens, but that's your only leading indicator for a drive failure: a healthy drive can have SMART failures, but they're usually all in the drive's infancy, so the SMART error counters should trend flat. But a positive linear (or worse) trend means the firmware is trying to keep up with progressing or ongoing damage to the disk's magnetic medium; one day, the firmware will run out of spare sectors and the drive will suddenly go from "chugging along like a champ" to "spewing errors", because the firmware will no longer be able to keep the ECCs fresh fast enough to hide the damage from the OS.

From 108.60.100.203 at 2013-05-10 15:07:34:

Perhaps adding another mirror drive to vdev with suspicious disk could make sense. If things continue to deteriorate, you would still have some redundancy in the vdev. If it turns out to be a false alarm, extra disk could be eventually removed.

Written on 08 May 2013.
« How ZFS resilvering saved us
Disk IO is what shatters the VM illusion for me right now »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed May 8 22:24:52 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.