== An alarming ZFS status message and what is usually going on with it Suppose that you have a ZFS pool with redundancy (mirroring or ZFS's version of RAID 5 or RAID 6), and that someday you run '_zpool status_' and see the alarming output: > _status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected._ > > _action: Determine if the device needs to be replaced, and clear the > errors using 'zpool clear' or replace the device with 'zpool replace'._ (This has been re-linewrapped for my convenience.) The rest of the _zpool status_ output should have one or more disks with non-zero _CKSUM_ fields and a final line that reports '_errors: No known data errors_'. What this generally really means is something like this: > ZFS has detected repairable checksum errors and has repaired them by > rewriting the affected disk blocks. If the errors are from a slowly > failing disk, replace the disk with '_zpool replace_'; if they are > instead from temporary problems in the storage system, clear this > message and the error counts with '_zpool clear_'. You may wish to check > this pool for other latent errors with '_zpool scrub_'. (I have to admit that Sun's own [[error explanation page http://www.sun.com/msg/ZFS-8000-9P]] for this is pretty good, too. This is unfortunately somewhat novel, which explains why I didn't look at it before now.) I assume that ZFS throws up this alarming status message even though it automatically handled the issue because it doesn't want to hide that a problem happened from you. While the problem might just be a temporary glitch (we've seen this a few times on our [[iSCSI based fileservers ZFSFileserverSetup]]), it might instead be an indication of a more serious issue that you should look into, so at least you need to know that something happened. (And even temporary glitches shouldn't happen all that often, or ideally at all; if they do, you have a problem somewhere.) === Sidebar: Our experience with these errors We've seen a few of these temporary glitches with our [[iSCSI based fileservers ZFSFileserverSetup]]. So far our procedure to deal with this is to note down at least which disk had the checksum errors (sometimes we save the full '_zpool status_' output for the pool), '_zpool clear_' the errors on that specific disk, and then '_zpool scrub_' the pool. This should normally turn up a clean bill of health; if it doesn't, I would re-clear and re-scrub and then panic if the second scrub did not come back clean. (Okay, I wouldn't panic, but I would replace the disk as fast as possible.) On our fileservers, my suspicions are on the hardware or driver for the onboard nVidia Ethernet ports. The fileservers periodically report that they lost and then immediately regained the link on _nge0_, which is one of the iSCSI networks, and usually report ((vhci_scsi_reset)) warnings at the same time. Unfortunately, the ever so verbose Solaris fault manager system does not log when the ZFS checksum errors are detected, so we can't correlate them to _nge0_ link resets. (In contributing evidence, the [[Linux iSCSI backends ../linux/LinuxISCSITargets]], running on very similar hardware, also had problems with their onboard nVidia Ethernet ports under sufficient load.)