We may have seen a ZFS checksum error be an early signal for later disk failure
I recently said some things about our experience with ZFS checksums on Twitter, and it turns out I have to take one bit of it back a bit. And in that lies an interesting story about what may be a coincidence and may not be.
A couple of weeks ago, we had our first disk failure in our new fileserver environment; everything went about as smoothly as we expected and our automatic spares system fixed things up in the short term. Specifically, what failed was one of the SSDs in our all-SSD fileserver, and it went off the cliff abruptly, going from all being fine to reporting some problems to having so many issues that ZFS faulted it within a few hours. And that SSD hadn't reported any previous problems, with no one-off read errors or the like.
Well, sort of. Which is where the interesting part comes in. Today, when I was checking our records for another reason, I discovered that a single ZFS checksum error had been reported against that disk back at the end of August. There were no IO errors reported on either the fileserver or the iSCSI backend, and the checksum error didn't repeat on a scrub, so I wrote it off as a weird one-off glitch.
(And I do mean 'one checksum error', as in ZFS's checksum error count was '1'. And ZFS didn't report that any bytes of data had been fixed.)
This could be a complete coincidence. Or it could be that this SSD checksum error was actually an early warning signal that something was going wrong deep in the SSD. I have no answers, just a data point.
(We've now had another disk failure, this time a HD, and it didn't have any checksum errors in advance of the failure. Also, I have to admit that although I would like this to be an early warning signal because it would be quite handy, I suspect it's more likely to be pure happenstance. The checksum error being an early warning signal makes a really attractive story, which is one reason I reflexively distrust it.)
PS: We don't have SMART data from the SSD, either at the time of the checksum error or at the time of its failure. Next time around I'll be recording SMART data from any disk that has checksum errors reported against it, just in case something can be gleamed from it.