How ZFS bookmarks can work their magic with reasonable efficiency

February 24, 2017

My description of ZFS bookmarks covered what they're good for, but it didn't talk about what they are at a mechanical level. It's all very well to say 'bookmarks mark the point in time when [a] snapshot was created', but how does that actually work, and how does it allow you to use them for incremental ZFS send streams?

The succinct version is that a bookmark is basically a transaction group (txg) number. In ZFS, everything is created as part of a transaction group and gets tagged with the TXG of when it was created. Since things in ZFS are also immutable once written, we know that an object created in a given TXG can't have anything under it that was created in a more recent TXG (although it may well point to things created in older transaction groups). If you have an old directory with an old file and you change a block in the old file, the immutability of ZFS means that you need to write a new version of the data block, a new version of the file metadata that points to the new data block, a new version of the directory metadata that points to the new file metadata, and so on all the way up the tree, and all of those new versions will get a new birth TXG.

This means that given a TXG, it's reasonably efficient to walk down an entire ZFS filesystem (or snapshot) to find everything that was changed since that TXG. When you hit an object with a birth TXG before (or at) your target TXG, you know that you don't have to visit the object's children because they can't have been changed more recently than the object itself. If you bundle up all of the changed objects that you find in a suitable order, you have an incremental send stream. Many of the changed objects you're sending will contain references to older unchanged objects that you're not sending, but if your target has your starting TXG, you know it has all of those unchanged objects already.

To put it succinctly, I'll quote a code comment from libzfs_core.c (via):

If "from" is a bookmark, the indirect blocks in the destination snapshot are traversed, looking for blocks with a birth time since the creation TXG of the snapshot this bookmark was created from. This will result in significantly more I/O and be less efficient than a send space estimation on an equivalent snapshot.

(This is a comment about getting a space estimate for incremental sends, not about doing the send itself, but it's a good summary and it describes the actual process of generating the send as far as I can see.)

Yesterday I said that ZFS bookmarks could in theory be used for an imprecise version of 'zfs diff'. What makes this necessarily imprecise is that while scanning forward from a TXG this way can tell you all of the new objects and it can tell you what is the same, it can't explicitly tell you what has disappeared. Suppose we delete a file. This will necessarily create a new version of the directory the file was in and this new version will have a recent TXG, so we'll find the new version of the directory in our tree scan. But without the original version of the directory to compare against we can't tell what changed, just that something did.

(Similarly, we can't entirely tell the difference between 'a new file was added to this directory' and 'an existing file had all its contents changed or rewritten'. Both will create new file metadata that will have a new TXG. We can tell the case of a file being partially updated, because then some of the file's data blocks will have old TXGs.)

Bookmarks specifically don't preserve the original versions of things; that's why they take no space. Snapshots do preserve the original versions, but they take up space to do that. We can't get something for nothing here.

(More useful sources on the details of bookmarks are this reddit ZFS entry and a slide deck by Matthew Ahrens. Illumos issue 4369 is the original ZFS bookmarks issue.)

Sidebar: Space estimates versus actually creating the incremental send

Creating the actual incremental send stream works exactly the same for sends based on snapshots and sends based on bookmarks. If you look at dmu_send in dmu_send.c, you can see that in the case of a snapshot it basically creates a synthetic bookmark from snapshot's creation information; with a real bookmark, it retrieves the data through dsl_bookmark_lookup. In both cases, the important piece of data is zmb_creation_txg, the TXG to start from.

This means that contrary to what I said yesterday, using bookmarks as the origin for an incremental send stream is just as fast as using snapshots.

What is different is if you ask for something that requires estimating the size of the incremental sends. Space estimates for snapshots are pretty efficient because they can be made using information about space usage in each snapshot. For details, see the comment before dsl_dataset_space_written in dsl_dataset.c. Estimating the space of a bookmark based incremental send requires basically doing the same walk over the ZFS object tree that will be done to generate the send data.

(The walk over the tree will be somewhat faster than the actual send, because in the actual send you have to read the data blocks too; in the tree walk, you only need to read metadata.)

So, you might wonder how you ask for something that requires a space estimate. If you're sending from a snapshot, you use 'zfs send -v ...'. If you're sending from a bookmark or a resume token, well, apparently you just don't; sending from a bookmark doesn't accept -v and -v on resume tokens means something different from what it does on snapshots. So this performance difference is kind of a shaggy dog story right now, since it seems that you can never actually use the slow path of space estimates on bookmarks.

Written on 24 February 2017.
« ZFS bookmarks and what they're good for
What an actual assessment of Ubuntu kernel security updates looks like »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Feb 24 00:26:44 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.