A brief summary of how ZFS updates (top-level) metadata

May 10, 2011

As is common in filesystems, a ZFS pool's metadata and data lives in essentially a tree; at the top of the tree is the ZFS uberblock and the actual root metaobject set (which is pointed to by the ZFS uberblock). Because ZFS is a copy on write filesystem, none of this metadata is overwritten in place. Instead, all metadata is written to a new location, all the way up to the uberblock. This is simple for everything except the uberblock; you write the new version of the metadata to some suitable bit of free space, then update its parent to point to the new location. However, the uberblock is the root, with no parent to update and no ability to rove randomly around the free space.

(I mentioned this in passing before, but I never described the details.)

ZFS does metadata updates, or at least uberblock updates, in what it calls a 'transaction group', which is often abbreviated as 'txg' in ZFS lingo. Each transaction group is numbered in an increasing sequence (which I believe is strictly monotonic, but I don't know for sure), and the ZFS uberblock has the transaction group number as well as a pointer to the current root metaobject set (well, to the transaction group's root metaobject set; it becomes the current root MOS when the transaction group commits).

In order to get copy on write for uberblocks, uberblocks are actually stored not in a single spot but in an array of 128 uberblock slots. Each successive transaction group writes its uberblock to the next slot (wrapping around at the end). When ZFS is bringing up a pool, it locates the current uberblock by scanning all 128 slots to find the uberblock with the highest transaction group number, ignoring invalid uberblocks entirely (uberblocks have both magic numbers and checksums).

('Scanning all 128 slots' is a little imprecise as a description, since uberblocks are highly replicated. Each disk in the pool has four copies of this 128 slot array, and I believe that bringing up a pool scans all possible uberblock copies and picks the best one. As I discovered earlier, it's valid for uberblock copies on some devices to be out of date.)

(This is all reasonably well known in the heavy-duty ZFS community, but I wanted to write it down in one easy to find spot for my future reference.)

Sidebar: the uberblock versus the root metaobject set

You might wonder why the uberblock is separate from the root metaobject set, since the two are intimately tied together. My understanding is the split is so that the uberblock can be a straightforward fixed-size structure while the root MOS is potentially variable sized and flexible. Uberblocks are also actually quite small; the official size is 1 Kbyte (although not all of that space is used). I believe that a root MOS often is much larger than that.

(This 1 Kb size is somewhat unfortunate given that disks are moving to a 4 Kb sector size. You'd like there to be only one uberblock in a given sector, because disks only really write in full sectors. With 1k uberblocks on a 4k sector disk, a write that goes wrong could destroy not just the uberblock you're writing but also another three, or at least up to three depending on where in the 4k real sector you're writing.)

Sidebar: some useful references for this stuff

Written on 10 May 2011.
« The different ways that you can lose a ZFS pool
An important way to get ZFS metadata corruption »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue May 10 12:23:36 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.