An unpleasant surprise about ZFS scrubbing in Solaris 10 U6

July 10, 2009

Here is something that we discovered recently: ZFS will refuse to scrub a pool that is DEGRADED, even if the degraded state is harmless to actual pool redundancy and there is no resilvering going on. In the usual ZFS manner, it doesn't give you any actual errors, it just doesn't do anything when you ask for a pool scrub.

(Now, I can't be completely and utterly sure that it was the DEGRADED state that blocked the scrub and not coincidence or something unrelated. But I do know that the moment we zpool detach'd the faulted device, restoring its vdev and thus the entire pool to the normal ONLINE state, we could start a scrub that did something.)

Regardless of what exactly is causing this, this behavior is bad (and a number of other words). When your pool is experiencing problems is exactly when you most want to scrub it, so you have the best information possible about how bad the problem is (and where it is) and you don't take actions in haste that actually make your problems worse.

I don't know what you could do if you couldn't detach the device. It's possible that ZFS somehow thought that the pool was still resilvering and thus that either exporting and importing the pool or rebooting the server would have fixed the problem (both of these often reset information about scrubs in zpool status output).

(Neither was an option on a production fileserver, so we didn't try them; this is pure speculation.)

Sidebar: exactly what happened

Last week, one side of one mirror in a pool on one of our ZFS fileservers started reporting read errors (and the iSCSI backend started reporting drive errors to go with it). Since we were shorthanded due to vacations, we opted to not immediately replace the disk; instead we added a third device to that vdev to make it a three-way mirror, so that we would still have a two-mirror even if the disk failed completely. That evening, the disk started throwing up enough read errors that ZFS declared it degraded and pulled in one of the configured spares for that pool and reconstructed the mirror, exactly as it was supposed to.

However, this put that vdev and thus the entire pool into a DEGRADED state, which is sort of reasonably logical from the right perspective (the pool is still fully redundant, but it has degraded from the configuration you set up). And, as mentioned, we couldn't scrub the pool; attempts to do so didn't do anything except change the time the 'add-the-spare' resilver nominally completed at to the current time in the output of zpool status.

Comments on this page:

From at 2009-07-10 08:30:44:

Remind me again why people keep saying this FS is better than sliced bread?

I can understand the idea that it's zomg awesome when everything is peachy, but any highly redundant system that rolls over at the slightest unexpected error isn't highly redundant. I just don't get what it is about this system that makes everyone so excited.

Matt Simmons

By cks at 2009-07-10 14:14:16:

ZFS's features are very attractive and in some areas (areas that do matter) ZFS does a significantly better job at avoiding data loss than a normal RAID can.

(And per before (well, its comments) I don't think that Sun has ever made any claims that ZFS is high availability or highly resilient, just that it won't let your files get silently corrupted. And ZFS is very good at that.)

The good news is that all of these issues I keep running into are quality of implementation problems, not fundamental issues with ZFS. So in the long run they can (in theory) be repaired.

Written on 10 July 2009.
« The high-level version of how mounting NFS filesystems works
What can go wrong in making NFS mounts »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jul 10 01:40:18 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.