ZFS's problem with generic messages

July 14, 2012

One of our ZFS pools just experienced an error, and this time around I actually read the status message that 'zpool status' prints out. Here, let me show it to you:

status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.

The actual error was a checksum error (a single one) with no 'repaired' count; the pool is mirrored and reported no known data errors.

The problem here is that checksum errors are normally fixed automatically; after reconstructing the affected data from other redundancy, ZFS will rewrite the blocks with a bad checksum (in place). This is what those 'repaired' sizes are reporting on. Thus normally a checksum error should not be reported as 'unrecoverable', because by the time you see the message ZFS has already recovered the error.

This leaves you with a dilemma: is ZFS actually telling the truth here and for some reason the checksum error was not recoverable and the data with the bad checksum is still there? This would explain why there is no count of the data repaired. Or is ZFS simply using some sort of generic message about the status, despite parts of it being completely wrong, and the lack of a 'repaired' count is for other reasons?

(One reason I can think of for having nothing to repair is that the checksum error was noticed in something being deleted. ZFS could be smart enough to not bother wasting time with a checksum repair in this situation.)

Trawling through my mail archive suggests that it's the second case (we have messages with the same message and 'see:' error code that do show a repaired count). This is a problem. 'Unrecoverable error' is very strong language to a sysadmin and using it when the error is not unrecoverable means that this message is an alarming lie.

(Apparently I was correct when in the past I've mostly ignored the status: bits that 'zpool status' prints out.)

Oh well. Solaris doesn't surprise me any more, and anyways our version of Solaris is very out of date so this may have been fixed since then.

Written on 14 July 2012.
« Why system administration certifications have worked so far
My general issue with Unicode in Python 3 »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jul 14 02:14:39 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.