An alarming ZFS status message and what is usually going on with it

February 4, 2009

Suppose that you have a ZFS pool with redundancy (mirroring or ZFS's version of RAID 5 or RAID 6), and that someday you run 'zpool status' and see the alarming output:

status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.

(This has been re-linewrapped for my convenience.)

The rest of the zpool status output should have one or more disks with non-zero CKSUM fields and a final line that reports 'errors: No known data errors'.

What this generally really means is something like this:

ZFS has detected repairable checksum errors and has repaired them by rewriting the affected disk blocks. If the errors are from a slowly failing disk, replace the disk with 'zpool replace'; if they are instead from temporary problems in the storage system, clear this message and the error counts with 'zpool clear'. You may wish to check this pool for other latent errors with 'zpool scrub'.

(I have to admit that Sun's own error explanation page for this is pretty good, too. This is unfortunately somewhat novel, which explains why I didn't look at it before now.)

I assume that ZFS throws up this alarming status message even though it automatically handled the issue because it doesn't want to hide that a problem happened from you. While the problem might just be a temporary glitch (we've seen this a few times on our iSCSI based fileservers), it might instead be an indication of a more serious issue that you should look into, so at least you need to know that something happened.

(And even temporary glitches shouldn't happen all that often, or ideally at all; if they do, you have a problem somewhere.)

Sidebar: Our experience with these errors

We've seen a few of these temporary glitches with our iSCSI based fileservers. So far our procedure to deal with this is to note down at least which disk had the checksum errors (sometimes we save the full 'zpool status' output for the pool), 'zpool clear' the errors on that specific disk, and then 'zpool scrub' the pool. This should normally turn up a clean bill of health; if it doesn't, I would re-clear and re-scrub and then panic if the second scrub did not come back clean. (Okay, I wouldn't panic, but I would replace the disk as fast as possible.)

On our fileservers, my suspicions are on the hardware or driver for the onboard nVidia Ethernet ports. The fileservers periodically report that they lost and then immediately regained the link on nge0, which is one of the iSCSI networks, and usually report vhci_scsi_reset warnings at the same time. Unfortunately, the ever so verbose Solaris fault manager system does not log when the ZFS checksum errors are detected, so we can't correlate them to nge0 link resets.

(In contributing evidence, the Linux iSCSI backends, running on very similar hardware, also had problems with their onboard nVidia Ethernet ports under sufficient load.)

Written on 04 February 2009.
« Why btrfs was inevitable: a corollary to (not) getting ZFS in Linux
Our SunFire X2100 nVidia Ethernet experiences »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Feb 4 23:56:16 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.