Wandering Thoughts archives

2013-10-18

ZFS uberblock rollback and the top level metadata change rate

ZFS keeps lots of copies of a pool's uberblock; on a standard pool on disks with 512 byte sectors, you will have at least 127 old uberblocks. In an emergency ZFS will let you roll back to a previous uberblock. So clearly you have a lot of possibilities for rollback, right? Actually, no. You have far less than you might think. The root problem is a misconception about the rate of change in pool and filesystem metadata.

In a conventional filesystem implementation, top level metadata changes infrequently or rarely for most filesystems; generally things like the contents of the filesystem's root directory are basically static. Even if you know that your filesystem is copy-on-write (as ZFS is) you might expect that since the root directory changes rarely it won't be copied very often. This feeds the idea that most of those 127 uberblocks will be pointing to things that haven't been freed and reused yet, in fact perhaps often the same thing.

This is incorrect. Instead, top level ZFS metadata is the most frequently changing thing in your ZFS pool and as a result old top level metadata gets freed all the time (although it may not get reused immediately, depending on pool free space, allocation patterns, and so on). What causes this metadata churn is block pointers combined with the copy on write nature of ZFS. Every piece of metadata that refers to something else (including all directories and filesystem roots) do so by block address. Because ZFS never updates anything in place changing one thing (say a data block in a file) changes its block address, which forces a change in the file's metadata to point to the new block address and which in turn changes the block address of the file's metadata, which needs a change in the metadata of the directory the file is in, which forces a change in the parent directory, and so on up the tree. The corollary of this is that any change in a ZFS pool changes the top level metadata.

The result is that every new uberblock written has a new set of top level metadata written with it, the meta-object set (MOS). And the moment a new uberblock is written the previous uberblock's MOS becomes free and its blocks become candidates to be reused (although not right away). When any of the MOS blocks do get reused, the associated uberblock becomes useless. How fast this happens depends on many things, but don't count on it not happening. ZFS snapshots of filesystems below the pool's root definitely don't preserve any particular MOS, although they do preserve a part of the old metadata that MOS(es) point to. I'm not sure that any snapshot operation (even on the pool root) will preserve a MOS itself, although some might.

(It would be an interesting experiment to export a non-test ZFS pool and then attempt to see how many of its uberblocks still had valid MOSes. My suspicion is that on an active pool, a lot would not. For bonus points you could try to determine how intact the metadata below the MOS was too and roughly how much of the resulting pool you'd lose if you imported it with that uberblock.)

PS: I've alluded to this metadata churn before in previous entries but I've never spelled it out explicitly (partly because I assumed it was obvious, which is probably a bad idea).

ZFSMetadataChangeRate written at 01:04:39; Add Comment

2013-10-13

Revisiting some bits of ZFS's ZIL with separate log devices

Back in this entry I described how the ZFS ZIL may or may not put large writes into the ZIL itself depending on various factors (how big they are, whether you're using a separate ZIL device, and so on). It turns out that I missed one potentially important factor, in fact one that affects more than large writes.

If you're using a separate log device, ZFS will normally put all write data into the ZIL (on the presumption that flushing data to the SLOG is faster than flushing it to the regular pool) and will then put the ZIL on your separate log device (unless you've turned this off with the logbias property). However this only applies if the log is not 'too big'.

What's 'too big'? That's the tunable zil_slog_limit, expressed in bytes, but how it gets used is a little bit obscure. First, let's backtrack to the overall ZIL structure. Each on disk ZIL is made up from some number of ZIL commits; these commits clean out over time as transaction groups push things into stable storage on the pool. This gives us two sizes: the size of the current ZIL commit that's being prepared and the total size of the (active) on disk ZIL at the moment.

What zil_slog_limit does is turn off use of the SLOG for large ZIL commits or large total ZIL log sizes. If the current ZIL commit is over zil_slog_limit or the current total ZIL log size is over twice zil_slog_limit, the ZIL commit is not written to your SLOG device but instead is written into the main pool. The default value of this tunable appears to be only one megabyte, which really startles me.

But wait, things get more fun. In ZFSWritesAndZIL I described how large writes are put directly into the ZIL if you have a separate log device, on the presumption that your SLOG is much faster than your actual disks. That decision is independent from the decision of whether your ZIL commit will be written to the SLOG or to your real disks (really, the code only checks 'does this have a SLOG?'). It appears to be quite possible to have a SLOG, have relatively large writes be put into a ZIL commit, and then have this ZIL commit written (relatively slowly) to your real disks instead of to your SLOG. You probably don't want this.

In a world where SLOG SSDs were tiny and precious, this may have made some sense. In a world where 60 GB SSDs are common as grass it's my opinion that this no longer really does in most environments. Most ZFS environments with SLOG SSDs will never come close to filling the SSD with active ZIL log entries because almost no one writes and fsync()s that much data that fast (you can and should measure this for yourself, of course, but this is the typical result). Raising zil_slog_limit substantially seems like a good idea to me (we'll probably tune it up to at least a gigabyte).

(See here for a nice overview of what gets written where and when and also some discussions about what may be faster under various circumstances.)

ZFSWritesAndZILII written at 00:35:26; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.