How ZFS bookmarks can work their magic with reasonable efficiency
My description of ZFS bookmarks covered what they're good for, but it didn't talk about what they are at a mechanical level. It's all very well to say 'bookmarks mark the point in time when [a] snapshot was created', but how does that actually work, and how does it allow you to use them for incremental ZFS send streams?
The succinct version is that a bookmark is basically a transaction group (txg) number. In ZFS, everything is created as part of a transaction group and gets tagged with the TXG of when it was created. Since things in ZFS are also immutable once written, we know that an object created in a given TXG can't have anything under it that was created in a more recent TXG (although it may well point to things created in older transaction groups). If you have an old directory with an old file and you change a block in the old file, the immutability of ZFS means that you need to write a new version of the data block, a new version of the file metadata that points to the new data block, a new version of the directory metadata that points to the new file metadata, and so on all the way up the tree, and all of those new versions will get a new birth TXG.
This means that given a TXG, it's reasonably efficient to walk down an entire ZFS filesystem (or snapshot) to find everything that was changed since that TXG. When you hit an object with a birth TXG before (or at) your target TXG, you know that you don't have to visit the object's children because they can't have been changed more recently than the object itself. If you bundle up all of the changed objects that you find in a suitable order, you have an incremental send stream. Many of the changed objects you're sending will contain references to older unchanged objects that you're not sending, but if your target has your starting TXG, you know it has all of those unchanged objects already.
If "from" is a bookmark, the indirect blocks in the destination snapshot are traversed, looking for blocks with a birth time since the creation TXG of the snapshot this bookmark was created from. This will result in significantly more I/O and be less efficient than a send space estimation on an equivalent snapshot.
(This is a comment about getting a space estimate for incremental sends, not about doing the send itself, but it's a good summary and it describes the actual process of generating the send as far as I can see.)
Yesterday I said that ZFS bookmarks could
in theory be used for an imprecise version of '
zfs diff'. What
makes this necessarily imprecise is that while scanning forward
from a TXG this way can tell you all of the new objects and it can
tell you what is the same, it can't explicitly tell you what has
disappeared. Suppose we delete a file. This will necessarily create
a new version of the directory the file was in and this new version
will have a recent TXG, so we'll find the new version of the directory
in our tree scan. But without the original version of the directory
to compare against we can't tell what changed, just that something
(Similarly, we can't entirely tell the difference between 'a new file was added to this directory' and 'an existing file had all its contents changed or rewritten'. Both will create new file metadata that will have a new TXG. We can tell the case of a file being partially updated, because then some of the file's data blocks will have old TXGs.)
Bookmarks specifically don't preserve the original versions of things; that's why they take no space. Snapshots do preserve the original versions, but they take up space to do that. We can't get something for nothing here.
Sidebar: Space estimates versus actually creating the incremental send
Creating the actual incremental send stream works exactly the same
for sends based on snapshots and sends based on bookmarks. If you
dmu_send in dmu_send.c,
you can see that in the case of a snapshot it basically creates a
synthetic bookmark from snapshot's creation information; with a real
bookmark, it retrieves the data through
both cases, the important piece of data is
TXG to start from.
This means that contrary to what I said yesterday, using bookmarks as the origin for an incremental send stream is just as fast as using snapshots.
What is different is if you ask for something that requires estimating
the size of the incremental sends. Space estimates for snapshots
are pretty efficient because they can be made using information
about space usage in each snapshot. For details, see the comment
dsl_dataset_space_written in dsl_dataset.c.
Estimating the space of a bookmark based incremental send requires
basically doing the same walk over the ZFS object tree that will be
done to generate the send data.
(The walk over the tree will be somewhat faster than the actual send, because in the actual send you have to read the data blocks too; in the tree walk, you only need to read metadata.)
So, you might wonder how you ask for something that requires a space
estimate. If you're sending from a snapshot, you use '
zfs send -v
...'. If you're sending from a bookmark or a resume token, well,
apparently you just don't; sending from a bookmark doesn't accept
-v on resume tokens means something different from what
it does on snapshots. So this performance difference is kind of a
shaggy dog story right now, since it seems that you can never
actually use the slow path of space estimates on bookmarks.