We may have seen a ZFS checksum error be an early signal for later disk failure

November 23, 2016

I recently said some things about our experience with ZFS checksums on Twitter, and it turns out I have to take one bit of it back a bit. And in that lies an interesting story about what may be a coincidence and may not be.

A couple of weeks ago, we had our first disk failure in our new fileserver environment; everything went about as smoothly as we expected and our automatic spares system fixed things up in the short term. Specifically, what failed was one of the SSDs in our all-SSD fileserver, and it went off the cliff abruptly, going from all being fine to reporting some problems to having so many issues that ZFS faulted it within a few hours. And that SSD hadn't reported any previous problems, with no one-off read errors or the like.

Well, sort of. Which is where the interesting part comes in. Today, when I was checking our records for another reason, I discovered that a single ZFS checksum error had been reported against that disk back at the end of August. There were no IO errors reported on either the fileserver or the iSCSI backend, and the checksum error didn't repeat on a scrub, so I wrote it off as a weird one-off glitch.

(And I do mean 'one checksum error', as in ZFS's checksum error count was '1'. And ZFS didn't report that any bytes of data had been fixed.)

This could be a complete coincidence. Or it could be that this SSD checksum error was actually an early warning signal that something was going wrong deep in the SSD. I have no answers, just a data point.

(We've now had another disk failure, this time a HD, and it didn't have any checksum errors in advance of the failure. Also, I have to admit that although I would like this to be an early warning signal because it would be quite handy, I suspect it's more likely to be pure happenstance. The checksum error being an early warning signal makes a really attractive story, which is one reason I reflexively distrust it.)

PS: We don't have SMART data from the SSD, either at the time of the checksum error or at the time of its failure. Next time around I'll be recording SMART data from any disk that has checksum errors reported against it, just in case something can be gleamed from it.


Comments on this page:

From 54.240.193.1 at 2016-11-23 06:41:57:

Sadly we would all like a magic flag to indicate when our hdd/ssd's are going to drop off a cliff.

When you only have a few hundred drives, it is very hard to separate correlation vs causation. One SSD, one checksum failure and I am sorry it could be anything.

Realistically we need to see a trend in 10,000 drives to make a conclusion. Yes this would be very useful...

I take it your monitoring smart stats through nagios or similar.

By Miksa at 2016-11-24 08:01:56:

That's a good point. Google and BackBlaze have released statistical studies about harddrive failures, but it would be useful if they also released more indepth data about possible signs of failure.

By Aneurin Price at 2016-11-24 09:24:46:

My experience has been pretty much the same as you tweeted: checksums have been very useful for reassuring me after something has gone wrong that it's been fixed correctly, but they've never alerted me to something that wasn't already very loudly obvious.

I'm not using ZFS at high volume like some people - I'd estimate that I've probably seen no more than a few dozen TBW across all the systems that use it - but so far I've seen no sign of silent corruption happening. That's not to say I'd be happy to disable checksums, but fear of silent data corruption is not the reason.

By cks at 2016-11-24 10:56:31:

Miksa: Backblaze has written a couple of blog entries on what SMART stats they find have predictive value, one in 2014 and one in 2016. I have a vague memory that Google put at least some information about this in a paper or two, with some surprising results (eg I think Google found that high drive temperatures didn't predict failure in their population), but I don't have references handy.

Written on 23 November 2016.
« Link: RFC 6919: Further Key Words for Use in RFCs to Indicate Requirement Levels
Sometimes a little change winds up setting off a large cascade of things »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Nov 23 00:29:49 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.