ZFS uberblock rollback and the top level metadata change rate

October 18, 2013

ZFS keeps lots of copies of a pool's uberblock; on a standard pool on disks with 512 byte sectors, you will have at least 127 old uberblocks. In an emergency ZFS will let you roll back to a previous uberblock. So clearly you have a lot of possibilities for rollback, right? Actually, no. You have far less than you might think. The root problem is a misconception about the rate of change in pool and filesystem metadata.

In a conventional filesystem implementation, top level metadata changes infrequently or rarely for most filesystems; generally things like the contents of the filesystem's root directory are basically static. Even if you know that your filesystem is copy-on-write (as ZFS is) you might expect that since the root directory changes rarely it won't be copied very often. This feeds the idea that most of those 127 uberblocks will be pointing to things that haven't been freed and reused yet, in fact perhaps often the same thing.

This is incorrect. Instead, top level ZFS metadata is the most frequently changing thing in your ZFS pool and as a result old top level metadata gets freed all the time (although it may not get reused immediately, depending on pool free space, allocation patterns, and so on). What causes this metadata churn is block pointers combined with the copy on write nature of ZFS. Every piece of metadata that refers to something else (including all directories and filesystem roots) do so by block address. Because ZFS never updates anything in place changing one thing (say a data block in a file) changes its block address, which forces a change in the file's metadata to point to the new block address and which in turn changes the block address of the file's metadata, which needs a change in the metadata of the directory the file is in, which forces a change in the parent directory, and so on up the tree. The corollary of this is that any change in a ZFS pool changes the top level metadata.

The result is that every new uberblock written has a new set of top level metadata written with it, the meta-object set (MOS). And the moment a new uberblock is written the previous uberblock's MOS becomes free and its blocks become candidates to be reused (although not right away). When any of the MOS blocks do get reused, the associated uberblock becomes useless. How fast this happens depends on many things, but don't count on it not happening. ZFS snapshots of filesystems below the pool's root definitely don't preserve any particular MOS, although they do preserve a part of the old metadata that MOS(es) point to. I'm not sure that any snapshot operation (even on the pool root) will preserve a MOS itself, although some might.

(It would be an interesting experiment to export a non-test ZFS pool and then attempt to see how many of its uberblocks still had valid MOSes. My suspicion is that on an active pool, a lot would not. For bonus points you could try to determine how intact the metadata below the MOS was too and roughly how much of the resulting pool you'd lose if you imported it with that uberblock.)

PS: I've alluded to this metadata churn before in previous entries but I've never spelled it out explicitly (partly because I assumed it was obvious, which is probably a bad idea).

Written on 18 October 2013.
« There are two cases for changing SSL/TLS cipher settings
I should never have allowed 'outside' content to break my layout »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 18 01:04:39 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.