When we replace disks in our ZFS fileserver environment

April 26, 2012

Recently, someone came here (well, here) as the result of a Google search for [zfs chksum non zero when to replace disk]. As it happens this is an issue that we've faced repeatedly so I can give you our answer. I don't claim that it's the right one but it's mostly worked for us.

First off, we have yet to replace a disk due to ZFS checksum errors. Our experience so far is that checksum errors are always transient and don't reappear after scrubs, so for us they've been a sign of (presumed) software weirdness instead of slowly failing disk drives. If we ever have a disk that repeatedly gets checksum errors we might consider it a sign of slow failure and preemptively replace the disk, but that hasn't happened so far.

The usual sign of a problematic disk here has been one or more persistent read errors. The cautious thing to do when this happens is to immediately replace the disk; for various reasons we don't usually do this if there are only a handful of read errors. Instead we mostly wait until one of three things: either there are more than a handful of read errors, the read error count is increasing, or it seems that handling the read errors is causing performance issues. For us, this balances the disruption of disk replacement (and the cost of disks) with the risk of serious data loss (and hasn't blown up in our faces yet).

(Because ZFS doesn't make any attempt to rewrite read errors (although I wish it would), they are basically permanent when they crop up. We do check reported read errors to see if the iSCSI backends are also reporting hard read errors, or if things look like transient problems.)

So that's my answer: don't replace on ZFS checksum errors unless there's something unusual or persistent about them and only replace on small numbers of read errors if you're cautious (and even then you should check to make sure that the actual disks are reporting persistent read errors). If we ever have hard write errors I expect that we'll replace the disk right away, but that hasn't happened yet.

(Based on our lack of write errors, you can probably guess that we have yet to have a disk die completely on us.)

We never reuse disks that we've pulled and replaced, even if they only had a few read errors. They are always either returned under the warranty or discarded. Yes, in theory they might be fine once those few bad sectors were remapped by being rewritten, but in practice the risk is not worth it.

Sidebar: why disk replacement is disruptive for us

Replacing disks is disruptive both to the sysadmins and to some degree to our users. Partly this is because our pools resilver slowly and with visible IO impact (note that ZFS resilvering is effectively seek limited in many cases and affects the whole pool). In our environment, replacing a physical disk the fully safe way can require up to six resilvers; if we restrict ourselves to one resilver at a time to keep the IO load down, that by itself can easily take all day. Another part of this is because pulling and replacing a disk is a manual procedure that takes a bunch of care and attention; for instance you need to make absolutely sure that you have matched up the iSCSI disk name with the disk that is reporting real errors on the iSCSI backend (despite a confusing mess of Linux names for disks) and then correctly mapped it to a physical disk slot and disk. This is not work that can be delegated (or scripted), so one of the core sysadmins is going to wind up babysitting any disk replacement.

(I'm sure that more upscale environments can just tell the software to turn on the fault light on the right disk drive enclosure and then send a minion to do a swap.)

Comments on this page:

From at 2012-04-26 20:47:24:

Our experience so far is that checksum errors are always transient and don't reappear after scrubs, so for us they've been a sign of (presumed) software weirdness instead of slowly failing disk drives.

Or there was some bit rot and it was fixed by copying good data from another mirrored drive (or re-creating it via RAIDZ) and replacing the bad data. Isn't that that the whole point of checksums and scrubs: go over all the bits to make sure things match?

By cks at 2012-04-30 09:50:41:

Belatedly: I had enough to say about this (and I think it's an important enough issue) that I turned my reply into an entry, ZFSReadErrorTypes.

Written on 26 April 2012.
« Models of providing computing access in a university department
The case of the Twitter spam I don't understand »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Apr 26 01:28:25 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.