How ZFS lets you recover from damaged metadata, and what the limitations are

May 11, 2011

Current versions of Solaris have a '-F' option for zpool import and zpool clear, which is documented like this:

Initiates recovery mode for an unopenable pool. Attempts to discard the last few transactions in the pool to return it to an openable state. Not all damaged pools can be recovered by using this option. If successful, the data from the discarded transactions is irretrievably lost.

The current Oracle documentation has a somewhat more detailed explanation, which mentions that this is specifically (only) for recovering from damaged pool metadata (which you may have wound up with because your disks lied to you).

What this is doing in a nutshell is reverting back to a previous uberblock for the pool and with it the previous top-level metadata (since more or less the only reason you update the uberblock is to use new top level metadata). This loses all changes that were committed in transaction groups that happened between the uberblock you're reverting to and the current uberblock. How many changes and how much data this is depends on at least how active your pool is.

(Under normal circumstances the transaction group commit interval is 30 seconds, but things can force transaction groups to commit more frequently.)

In theory, you have a lot of potential uberblocks to revert back to; pretty much any pool will have had at least 128 transaction groups commit and thus have 128 old uberblocks. A pool with sufficiently many vdevs can dredge up even more than that. In practice, there are at least two things that can limit how successful your uberblock reversion is.

First, you need the damaged metadata to have been changed between the current uberblock and the uberblock you're reverting to. This is sure for the very top level metadata, but I believe it's possible for lower-level metadata to not have changed and so not have had a new version written. If this lower level metadata gets damaged, a whole run of uberblocks can be made useless at once.

Second, once a transaction group commits the disk space of the previous uberblock's top level metadata is now generally unused and so is eligible to be written over by new data. ZFS will not reuse just-freed blocks for three transaction groups (per Eric Shrock's comments here), but beyond that there is no guarantee that the metadata an old uberblock points to is still intact and hasn't been overwritten. As Eric Shrock mentions, ZFS's uberblock reversion will thus normally not go back further than three transaction groups (you can override this limit with the undocumented '-X' command line switch if you really need to).

(The top level metadata is not always unused, due to snapshots.)

In the days before Solaris explicitly supported uberblock reversion, you could do it by hand on an exported pool by the brute force mechanism of invalidating or destroying uberblocks via raw disk writes. This did require you to get all of the uberblock copies, which could be quite a lot. People wrote some tools to help with this; I'm not going to link to any of them (since I don't know if they still work well) but you can find them with web searches if you really need them.

PS: since I just found it in the OpenSolaris source, the delayed reuse is controlled by TXG_DEFER_SIZE in uts/common/fs/zfs/sys/txg.h and the code that uses it in uts/common/fs/zfs/metaslab.c. It appears to work by not adding the theoretically freed blocks to the in-memory free space maps.

Sidebar: synchronous operations, the ZIL, and transaction groups

You might wonder if forced filesystem syncs from applications or the environment cause transaction groups to commit more frequently. My understanding is that they do not; instead, the necessary change information is written to the ZIL (ZFS Intent Log) and then later rolled into a normally scheduled transaction group commit. This happens whether or not you have a separate ZIL log device.

(These ZIL writes do cause low level disk flushes if disk caches are enabled, but that's a separate issue.)

I don't know all of the things that do cause transaction groups to commit faster than usual, but large amounts of data getting written is definitely one of them.

Written on 11 May 2011.
« An important way to get ZFS metadata corruption
Our network layout (as of May 2011) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed May 11 16:46:09 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.