An alarming ZFS status message and what is usually going on with it
Suppose that you have a ZFS pool with redundancy (mirroring or ZFS's
version of RAID 5 or RAID 6), and that someday you run '
and see the alarming output:
status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.
(This has been re-linewrapped for my convenience.)
The rest of the
zpool status output should have one or more disks with
CKSUM fields and a final line that reports '
errors: No known
What this generally really means is something like this:
ZFS has detected repairable checksum errors and has repaired them by rewriting the affected disk blocks. If the errors are from a slowly failing disk, replace the disk with '
zpool replace'; if they are instead from temporary problems in the storage system, clear this message and the error counts with '
zpool clear'. You may wish to check this pool for other latent errors with '
(I have to admit that Sun's own error explanation page for this is pretty good, too. This is unfortunately somewhat novel, which explains why I didn't look at it before now.)
I assume that ZFS throws up this alarming status message even though it automatically handled the issue because it doesn't want to hide that a problem happened from you. While the problem might just be a temporary glitch (we've seen this a few times on our iSCSI based fileservers), it might instead be an indication of a more serious issue that you should look into, so at least you need to know that something happened.
(And even temporary glitches shouldn't happen all that often, or ideally at all; if they do, you have a problem somewhere.)
Sidebar: Our experience with these errors
We've seen a few of these temporary glitches with our iSCSI based
fileservers. So far our procedure to deal with
this is to note down at least which disk had the checksum errors
(sometimes we save the full '
zpool status' output for the pool),
zpool clear' the errors on that specific disk, and then '
scrub' the pool. This should normally turn up a clean bill of health;
if it doesn't, I would re-clear and re-scrub and then panic if the
second scrub did not come back clean. (Okay, I wouldn't panic, but I
would replace the disk as fast as possible.)
On our fileservers, my suspicions are on the hardware or driver for the
onboard nVidia Ethernet ports. The fileservers periodically report that
they lost and then immediately regained the link on
nge0, which is one
of the iSCSI networks, and usually report
at the same time. Unfortunately, the ever so verbose Solaris fault
manager system does not log when the ZFS checksum errors are detected,
so we can't correlate them to
nge0 link resets.
(In contributing evidence, the Linux iSCSI backends, running on very similar hardware, also had problems with their onboard nVidia Ethernet ports under sufficient load.)