== How ZFS lets you recover from damaged metadata, and what the limitations are Current versions of Solaris have a '_-F_' option for _zpool import_ and _zpool clear_, which is documented like this: > Initiates recovery mode for an unopenable pool. Attempts to discard > the last few transactions in the pool to return it to an openable > state. Not all damaged pools can be recovered by using this option. If > successful, the data from the discarded transactions is irretrievably > lost. The [[current Oracle documentation http://download.oracle.com/docs/cd/E19253-01/819-5461/gavwg]] has a somewhat more detailed explanation, which mentions that this is specifically (only) for recovering from [[damaged pool metadata ZFSLosingPoolsWays]] (which you may have wound up with because [[your disks lied to you ZFSLosingPoolsWaysII]]). What this is doing in a nutshell is reverting back to [[a previous uberblock ZFSHowMetadataUpdates]] for the pool and with it the previous top-level metadata (since more or less the only reason you update the uberblock is to use new top level metadata). This loses all changes that were committed in transaction groups that happened between the uberblock you're reverting to and the current uberblock. How many changes and how much data this is depends on at least how active your pool is. (Under normal circumstances the transaction group commit interval is 30 seconds, but things can force transaction groups to commit more frequently.) In theory, you have a lot of potential uberblocks to revert back to; pretty much any pool will have had at least 128 transaction groups commit and thus have 128 old uberblocks. A pool with sufficiently many vdevs can dredge up even more than that. In practice, there are at least two things that can limit how successful your uberblock reversion is. First, you need the damaged metadata to have been changed between the current uberblock and the uberblock you're reverting to. This is sure for the very top level metadata, but I believe it's possible for lower-level metadata to not have changed and so not have had a new version written. If this lower level metadata gets damaged, a whole run of uberblocks can be made useless at once. Second, once a transaction group commits the disk space of the previous uberblock's top level metadata is now generally unused and so is eligible to be written over by new data. ZFS will not reuse just-freed blocks for three transaction groups (per Eric Shrock's comments [[here http://opensolaris.org/jive/thread.jspa?messageID=503970]]), but beyond that there is no guarantee that the metadata an old uberblock points to is still intact and hasn't been overwritten. As Eric Shrock mentions, ZFS's uberblock reversion will thus normally not go back further than three transaction groups (you can override this limit with the undocumented '_-X_' command line switch if you really need to). (The top level metadata is not always unused, due to snapshots.) In the days before Solaris explicitly supported uberblock reversion, you could do it by hand on an exported pool by the brute force mechanism of invalidating or destroying uberblocks via raw disk writes. This did require you to get all of the uberblock copies, [[which could be quite a lot ZFSUberblockWrites]]. People wrote some tools to help with this; I'm not going to link to any of them (since I don't know if they still work well) but you can find them with web searches if you really need them. PS: since I just found it in [[the OpenSolaris source PokingOpenSolarisSource]], the delayed reuse is controlled by ((TXG_DEFER_SIZE)) in uts/common/fs/zfs/sys/txg.h and the code that uses it in uts/common/fs/zfs/metaslab.c. It appears to work by not adding the theoretically freed blocks to the in-memory free space maps. === Sidebar: synchronous operations, the ZIL, and transaction groups You might wonder if forced filesystem syncs from applications or [[the environment SlowNFSWritesToZFS]] cause transaction groups to commit more frequently. My understanding is that they do not; instead, the necessary change information is written to the ZIL (ZFS Intent Log) and then later rolled into a normally scheduled transaction group commit. This happens whether or not you have a separate ZIL log device. (These ZIL writes do cause low level disk flushes if disk caches are enabled, but that's a separate issue.) I don't know all of the things that do cause transaction groups to commit faster than usual, but large amounts of data getting written is definitely one of them.