ZFS and various sorts of read errors

April 28, 2012

After I wrote about our experience of transient checksum errors in ZFS here, a commentator wrote (quoting me):

Our experience so far is that checksum errors are always transient and don't reappear after scrubs, so for us they've been a sign of (presumed) software weirdness instead of slowly failing disk drives.

Or there was some bit rot and it was fixed by copying good data from another mirrored drive (or re-creating it via RAIDZ) and replacing the bad data. Isn't that that the whole point of checksums and scrubs: go over all the bits to make sure things match?

My view is that the ice is dangerously thin here and that it's safer for us to assume that the checksum failures are not from disk bit rot.

As far as ZFS is concerned there are two sorts of read errors, hard read errors (where the underlying device or storage system reports an error and returns no data) and checksum errors (where the underlying storage claims to succeed but returns data that ZFS can see is incorrect). ZFS covers up both sorts of errors using whatever redundancy the pool (well, the vdev) has, but otherwise it treats them differently; it never attempts to repair read errors (although it's willing to try the read again later) while it immediately repairs bad checksums by rewriting the data in place.

My understanding of modern disks is that on-disk bit rot rarely goes undetected, since the actual on-disk data is protected by pretty good ECC checks (although they're not as strong as ZFS's checksums). When a disk detects a failed ECC (and cannot repair the damage), it returns a hard read error for that sector. You can still have various forms of in-flight corruption (sometimes as the data is being written, which means that the on-disk data is bad but probably passes the drive's ECC); all of these (broadly construed) read errors will result in nominally successful reads but ZFS checksum errors, which ZFS will then fix.

So the important question is: how many of the checksum errors that one sees are actually real read errors that were not recognized as such, either on-disk bit rot that still passed the drive's ECC checks or in-flight corruption inside the drive, and how many of them are from something else?

I don't know the answer to this, which is why I think the ice is thin. Right now my default assumption is that most or all of the actual drive bit rot is being detected as hard read errors; I make this partly because it's the safer assumption (since it means that we don't understand the causes of our checksum failures).

PS: ZFS's treatment of read errors means that in some ways you would be better off if you could tell your storage system to lie about them, so that instead of returning an actual error it would just log it and return random data. This would force a checksum error, causing ZFS to rewrite the data, which would force the sector to be rewritten and perhaps spared out.

(Yes, this is kind of a crazy idea.)

Sidebar: the purpose of scrubs

Scrubs do three things: they uncover hard read errors, they find and repair any checksum errors, and at a high level they verify that your data is actually redundant and tell you if it isn't. Because ZFS never rewrites hard read errors, scrubs do not necessarily restore full redundancy. But at least you know (via read errors that persist over repeated scrubs) that you have a potential problem that you need to do something about (ie you need to replace the disk with read errors).

(Because a ZFS scrub only reads live data, you know that any read error is in a spot that is actually being used for current data.)

Sidebar: the redundancy effects of read errors

If your vdevs are only single-redundant, a read error means that that particular piece of data is not redundant at all. If you have multi-way redundancy, eg from raidz2, and you have read errors on multiple disks I don't know if there's any way to know how much redundancy any particular piece of data has left. Note that ZFS does not always write a piece of data to the same offset on all disks, although it usually does.

(If you have multi-way redundancy and read errors on only a single disk, all of your data is still redundant although some of it is more exposed than it used to be.)


Comments on this page:

From 85.250.215.77 at 2012-04-28 05:00:48:

NL-SAS drives do have the option to return junk data in case of some read errors (medium read errors). I don't quite remember how to set it but have seen it in the scsi docs and didn't quite figure out what it's good for.

Baruch

From 70.30.88.85 at 2012-04-28 11:22:40:

My understanding of modern disks is that on-disk bit rot rarely goes undetected, since the actual on-disk data is protected by pretty good ECC checks (although they're not as strong as ZFS's checksums).

Before going live and collecting data with the LHC, CERN did a bunch of testing of their data storage system. They found single-bit errors, single-sector errors (512B), and even discovered a bug caused by the interaction of 3Ware controllers interacting with WD disks:

http://storagemojo.com/2007/09/19/cerns-data-corruption-research/

While bit rot has (perhaps) rarely gone undetected for the last few decades, we are approaching the point where it may become more likely (statistically speaking). This is because while drives do have internal ECC, the algorithms have a limit. SATA drives typically can detect of 1 bit of error in 10^14 bits transferred; NL-SAS (e.g., Seagate Constellation), 10^15; high-RPM (Seagate Savvio), 10^16. These are the bit error rates (BERs) listed on the spec sheet of each drive.

10^14 is 100 TB, which, if you have a moderately sized array, isn't a lot of data that is moved around before you start tripping over undetected flipped bits. I currently help run an HPC cluster with about 1 PB of online storage—and we're quite small really.

Supposedly it's also one of the reasons for moving to the "Advanced Format" disk drives. With current 512B drives, there is 40B of ECC (7% overhead); with future 4K sectors there will be a 100B ECC, which will help catch more errors and reduce overhead.

Most people won't see these things, but more and more data centre arrays will probably be running into them going forward just because of the volume of data that's becoming common nowadays.

[...] it never attempts to repair read errors (although it's willing to try the read again later) while it immediately repairs bad checksums by rewriting the data in place.

I was under the impression that it wasn't in-place, but rather COW, just like (almost) all other types of I/O issued by ZFS. In vdev__{radiz,mirror}.c, a zio__nowait(zio__vdev__child__io()) is issued with the ZIO__FLAG__IO __REPAIR flag; this in turn calls zio__create(), which constructs the zio__t data structure. There doesn't seem to be any special flags enabled to indicate an in-place overwrite. (Had to use double-underscores as a single one is parsed as markup.)

http://src.illumos.org/source/

As I understand ZFS, only the uberblock is regularly overwritten, or blocks that are marked as 'free'.

By cks at 2012-04-28 15:36:51:

You bring up a good point about whether checksums are repaired in place; I haven't checked the source or instrumented a system, so I don't know for sure. However, I expect it for two reasons. First, there's no reason not to; unlike other in-place overwrites, ZFS knows that there is no good data at the target location. Second, writing the repaired checksum to somewhere else would cause a cascade of additional updates (since you now have to update block pointers to point to the new location of the repaired data). This seems undesirable and I'm not sure it'd always be possible.

(In fact it seems like ZFS would have to do some rather twisted things to update old snapshots that needed checksum repairs.)

Written on 28 April 2012.
« The case of the Twitter spam I don't understand
My two approaches to learning (programming) languages »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Apr 28 01:46:53 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.