How many uberblock copies ZFS writes on transaction group commits

August 29, 2010

The ZFS uberblock is the root of the tree of ZFS objects in a pool (technically it is more a pointer to the root, but it's how ZFS finds everything). Because it is the root, it has to be updated on every ZFS update. Also, because it is so important ZFS keeps quite a lot of copies of it (and has a clever 'copy on write' scheme for it, so that it is not quite updated in place).

Given the impact of writes on mirrors, it becomes interesting to ask how many copies of the uberblock ZFS actually writes out. First off, the raw numbers: there are four copies of the uberblock on every physical disk in the pool, stored as part of the four ZFS label areas (two at the start of the disk and two at the end).

At the top level, there are two cases for uberblock updates. If the vdev or pool configuration is considered 'dirty', the uberblock is synced on all vdevs. If the pool configuration is unchanged, the uberblock is synced to at most three vdevs, chosen by picking a random start point in the list of top level vdevs. If your pool only has three (or less) top level vdevs, the two cases are equivalent.

(My powers of reading the OpenSolaris source code are not quite up to determining exactly what makes the pool configuration dirty. Certainly adding or removing devices does it, and I think that devices changing their state does as well. There are likely to be other things as well.)

When ZFS syncs the uberblock to a top level vdev, it writes copies of the new uberblock to all physical devices in the vdev (well, all currently live devices); to all mirrors in a mirrored vdev, and to all disks in a raidzN vdev. Syncing the uberblock to a physical device involves four separate writes. But wait, we're not done. To actually activate the new uberblock, the ZFS labels themselves must be updated, which is done separately from writing the uberblocks. This takes another four writes for each physical disk that is having its uberblock(s) updated.

So the simple answer is that if your pool has three or less top level vdevs, you will do eight writes per physical device every time a transaction group commits in order to write the uberblocks out. Fortunately transaction groups don't commit very often.

Sidebar: the uberblock update sequence

The best documentation for the logic of the uberblock update sequence is in vdev_config_sync() in uts/common/fs/zfs/vdev_label.c. The short version is that it goes:

  • update the first two labels. They now do not match any uberblock and are thus invalid, but the last two labels are still good.
  • write out all of the new uberblocks. If this works, the first two labels are valid; if it doesn't work, the last two labels are still pointing to valid older uberblocks.
  • update the last two labels.

(ZFS labels include what transaction group (aka 'txg') that they are valid for. They do not explicitly point to an uberblock. Uberblocks contain a transaction group number and a pointer to that txg's metaobject set (MOS), which is the real root of the ZFS pool.)

Written on 29 August 2010.
« The different between an SMTP proxy and a SMTP relay
A Bourne shell irritation: no wildcard matching operator »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Aug 29 01:55:34 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.