Our pragmatic experiences with (ZFS) disk errors in our infrastructure
I wrote before about when we replace disks based on errors (and then more on ZFS read errors). Today I want to talk about our pragmatic experiences in our fileserver infrastructure. The first and most important thing to understand about our experiences is that in our environment disk errors are indirect things. Because we are using iSCSI backends, ZFS does not have access to the actual SATA disk status; instead, all it gets is whatever the iSCSI backends report.
(I find it plausible and indeed likely that ZFS could behave somewhat differently if it was dealing directly with SATA disks and had better error reports available to it.)
On the backends themselves we see two levels of read errors, what I will call soft read errors and hard read errors. Both soft and hard read errors seem to generally result in SATA channel resets (which affect all disks on the channel); the difference between the two is that at the end of a soft error the read appears to succeed, while at the end of a hard error we see the Linux kernel log an actual read error (and then iSCSI relays the read error to Solaris and ZFS). On the backends, soft disk errors only report the ATA device name for the disk involved, which can make finding it a little bit interesting; hard read errors report the full name. Handling soft read errors can sometimes take long enough that Solaris sees an IO timeout and retries the IO (and logs a message about it), but usually the only sign on the fileservers themselves is slow IO.
(It's possible that some reads from soft errors are actually returning corrupted data and this is the cause of some of our checksum errors. However, I don't think we've seen a strong correlation between reported checksum errors in ZFS and soft read errors on the backends.)
Our experience is that SMART error reports (on the backends) are all but
useless. We do not always see SMART errors for hard read errors (much
less soft ones) and we see SMART errors reported on disks that have no
observable problems. At this point SMART reports are mostly useful for
catastrophic things like 'the disk disappeared'; however, we've seen
spurious reports even for those (our current theory is that a
check at the wrong time during a SATA channel reset can fail to see
As far as we've been able to see, hard read errors do get reported to Solaris and ZFS and do result in ZFS read errors. However, I admit that we haven't generally done forward checks here (noticing hard read errors on the backends and then seeing that the Solaris fileservers reported hard read errors at the same time); instead, we have tended to work backwards from ZFS read errors on the fileservers to see that they are mostly hard read errors on the backends.
(Offhand, I'm not sure if we've seen ZFS read errors without hard read errors on the backends. It's a good question and we have some records, but I'm going to defer carefully checking them to a potential future entry.)
We haven't seen ZFS write errors unless the actual disks go away entirely (eg, if we pull a live disk ZFS will light up with write errors in short order). I don't think we've noticed any backend reports about write errors on running disks.
Our old version of Solaris is generally okay with both soft and hard
read errors; soft errors sometimes cause IO timeouts and hard read
errors wind up with actual ZFS-visible read errors (sometimes after
timeouts), but that has mostly been it. The one exception is a single
Solaris fileserver install that got itself into an odd state that we
don't understand. Although it was theoretically identical to all of
our other fileservers, this single fileserver had a very bad reaction
to read errors at the ZFS level; after a while NFS became very slow or
non-responsive and all ZFS operations would usually eventually start
locking up entirely (even things like '
zpool status' for a pool not
experiencing IO problems). Once we identified the cause of its lockups,
we started aggressively replacing its backend disks the moment they
reported hard read errors. This machine had other iSCSI anomalies (eg,
it established iSCSI connections at boot very slowly) and we eventually
replaced its Solaris install, which seems to have made the problem go
(Our troubleshooting was complicated by the fact that this is our only fileserver that uses 1.5 TB disks instead of 750 GB disks on the backends and almost all of our problem disks have been 1.5 TB disks. We weren't clear if it was just how ZFS reacted to this sort of slow hard read errors over iSCSI, something different about the disks, some hardware problem on the fileserver server, something different about the iSCSI backends it used, and so on.)