2022-03-13
We do see ZFS checksum failures, but only infrequently
One of the questions hovering behind ZFS is how often, in practice, you actually see data corruption issues that are caught by checksums and other measures, especially on modern solid state disks. On our old OmniOS and iSCSI fileserver environment we saw somewhat regular ZFS checksum failures, but that environment had a lot of moving parts, ranging from iSCSI through spinning rust. Our current fileserver environment uses local SSDs, and initially it seemed we were simply not experiencing checksum failures any more. Over time, though, we have experienced some (well, some not associated with SSDs that failed completely minutes later).
Because there's no in-pool persistent count of errors, I have to extract this information from our worklog reports of clearing checksum errors, which means that I may well have missed some. Our current fileserver infrastructure has been running since around September of 2018, so many pools are now coming up on three and a half years old.
- In early 2019, a SSD experienced an escalating series of checksum
failures over multiple days that eventually caused ZFS to fault the
disk out. We replaced the SSD. No I/O errors were ever reported for
it.
- in mid 2019, a SSD with no I/O errors had a single checksum failure
found in a scrub, which might have come from a NAND block failing and
being reallocated (based on SMART data). The disk is still in service
as far as I can tell, with no other problems.
- at the end of August 2019, an otherwise problem-free SSD had one
checksum error found in a scrub. Again, SMART data suggests it
may have been some sort of NAND block failure that resulted in a
reallocation. The disk is still in service with no other problems.
- in mid 2021, a SSD reported six checksum errors during a scrub. As in all the other cases, SMART data suggests there was a NAND block failure and reallocation, and the disk didn't report any I/O errors. The disk is still in service with no other problems.
(We also had a SSD report a genuine read failure at the end of 2019. ZFS repaired 128 Kb and the pool scrubbed fine afterward.)
So we've seen three incidents of checksum failures (two of which were only for a single ZFS block) on disks that have otherwise been completely fine, and one case where checksum failures were an early warning of disk failures. We started out with six fileservers, each with 16 ZFS data disks, and added a seventh fileserver later (none of these SSD checksum reports are from the newest fileserver). Conservatively, this means that our three or four incidents are across 96 disks.
(At the same time, this means four out of 96 or so SSDs had a checksum problem at some point, which is about a 4% rate.)
We have actually had a number of SSD failures on these fileservers. I'm not going to try to count how many, but I'm pretty certain that there have been more than four. This means that in our fileserver environment, SSDs seem to fail outright more often than they experience checksum failures. Having written this entry, I'm actually surprised by how infrequent checksum failures seem to be.
(I'm not going to try to count SSD failures, because that too would require going back through worklog messages.)