solaris/ZFSGenericMsgProblem written at 02:14:39; Add Comment
ZFS's problem with generic messages
One of our ZFS pools just experienced an error, and this time around
I actually read the status message that '
status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
The actual error was a checksum error (a single one) with no 'repaired' count; the pool is mirrored and reported no known data errors.
The problem here is that checksum errors are normally fixed automatically; after reconstructing the affected data from other redundancy, ZFS will rewrite the blocks with a bad checksum (in place). This is what those 'repaired' sizes are reporting on. Thus normally a checksum error should not be reported as 'unrecoverable', because by the time you see the message ZFS has already recovered the error.
This leaves you with a dilemma: is ZFS actually telling the truth here and for some reason the checksum error was not recoverable and the data with the bad checksum is still there? This would explain why there is no count of the data repaired. Or is ZFS simply using some sort of generic message about the status, despite parts of it being completely wrong, and the lack of a 'repaired' count is for other reasons?
(One reason I can think of for having nothing to repair is that the checksum error was noticed in something being deleted. ZFS could be smart enough to not bother wasting time with a checksum repair in this situation.)
Trawling through my mail archive suggests that it's the second case (we
have messages with the same message and '
(Apparently I was correct when in the past I've mostly ignored the
Oh well. Solaris doesn't surprise me any more, and anyways our version of Solaris is very out of date so this may have been fixed since then.
* * *
Atom feeds are available; see the bottom of most pages.