ZFS and various sorts of read errors
After I wrote about our experience of transient checksum errors in ZFS here, a commentator wrote (quoting me):
Our experience so far is that checksum errors are always transient and don't reappear after scrubs, so for us they've been a sign of (presumed) software weirdness instead of slowly failing disk drives.
Or there was some bit rot and it was fixed by copying good data from another mirrored drive (or re-creating it via RAIDZ) and replacing the bad data. Isn't that that the whole point of checksums and scrubs: go over all the bits to make sure things match?
My view is that the ice is dangerously thin here and that it's safer for us to assume that the checksum failures are not from disk bit rot.
As far as ZFS is concerned there are two sorts of read errors, hard read errors (where the underlying device or storage system reports an error and returns no data) and checksum errors (where the underlying storage claims to succeed but returns data that ZFS can see is incorrect). ZFS covers up both sorts of errors using whatever redundancy the pool (well, the vdev) has, but otherwise it treats them differently; it never attempts to repair read errors (although it's willing to try the read again later) while it immediately repairs bad checksums by rewriting the data in place.
My understanding of modern disks is that on-disk bit rot rarely goes undetected, since the actual on-disk data is protected by pretty good ECC checks (although they're not as strong as ZFS's checksums). When a disk detects a failed ECC (and cannot repair the damage), it returns a hard read error for that sector. You can still have various forms of in-flight corruption (sometimes as the data is being written, which means that the on-disk data is bad but probably passes the drive's ECC); all of these (broadly construed) read errors will result in nominally successful reads but ZFS checksum errors, which ZFS will then fix.
So the important question is: how many of the checksum errors that one sees are actually real read errors that were not recognized as such, either on-disk bit rot that still passed the drive's ECC checks or in-flight corruption inside the drive, and how many of them are from something else?
I don't know the answer to this, which is why I think the ice is thin. Right now my default assumption is that most or all of the actual drive bit rot is being detected as hard read errors; I make this partly because it's the safer assumption (since it means that we don't understand the causes of our checksum failures).
PS: ZFS's treatment of read errors means that in some ways you would be better off if you could tell your storage system to lie about them, so that instead of returning an actual error it would just log it and return random data. This would force a checksum error, causing ZFS to rewrite the data, which would force the sector to be rewritten and perhaps spared out.
(Yes, this is kind of a crazy idea.)
Sidebar: the purpose of scrubs
Scrubs do three things: they uncover hard read errors, they find and repair any checksum errors, and at a high level they verify that your data is actually redundant and tell you if it isn't. Because ZFS never rewrites hard read errors, scrubs do not necessarily restore full redundancy. But at least you know (via read errors that persist over repeated scrubs) that you have a potential problem that you need to do something about (ie you need to replace the disk with read errors).
(Because a ZFS scrub only reads live data, you know that any read error is in a spot that is actually being used for current data.)
Sidebar: the redundancy effects of read errors
If your vdevs are only single-redundant, a read error means that that particular piece of data is not redundant at all. If you have multi-way redundancy, eg from raidz2, and you have read errors on multiple disks I don't know if there's any way to know how much redundancy any particular piece of data has left. Note that ZFS does not always write a piece of data to the same offset on all disks, although it usually does.
(If you have multi-way redundancy and read errors on only a single disk, all of your data is still redundant although some of it is more exposed than it used to be.)