Wandering Thoughts

2017-06-14

The difference between ZFS scrubs and resilvers

At one level, asking what the difference is between scrubs and resilvers sounds silly; resilvering is replacing disks, while scrubbing is checking disks. But if you look at the ZFS code in the kernel things become much less clear, because both scrubs and resilvers use a common set of code to do all their work and it's actually not at all easy to tell what happens differently between them. Since I have actually dug through this section of ZFS code just the other day, I want to write down what the differences are while I remember them.

Both scrubs and resilvers traverse all of the metadata in the pool (in a non-sequential order), and both wind up reading all data. However, scrubs do this more thoroughly for mirrored vdevs; scrubs read all sides of a mirror, while resilvers only read one copy of the data (well, one intact copy). On raidz vdevs there is no difference here, as both scrubs and resilvers read both the data blocks and the parity blocks. This implies that a scrub on mirrored vdevs does more IO and (importantly) more checking than a resilver does. After a resilver of mirrored vdevs, you know that you have at least one intact copy of every piece of the pool, while after an error-free scrub of mirrored vdevs, you know that all ZFS metadata and data on all disks is fully intact.

For resilvers but not scrubs (at least normally), ZFS will sort of attempt to write everything back to the disks again, as I covered in the sidebar of last entry. As far as I can tell from the code, ZFS always skips even trying to write things back to disks that are either known to have good data (for example they were read from) or that are believed to be good because their DTL says that the data is clean on them (I believe that this case only really applies for mirrored vdevs). For scrubs, the only time ZFS submits writes to the disks is if it detects actual errors.

(Although I don't entirely understand how you get into this situation, it appears that a scrub of a pool with a 'replacing' disk behaves a lot like a resilver as far as that disk is concerned. As you would expect and want, the scrub doesn't try to read from the new disk and, like resilvers, it tries to write everything back to the disk.)

Since we only have mirrored vdevs on our fileservers, what really matters to us here is the difference in what gets read between scrubs and resilvers. On the one hand, resilvers put less read load on the disks, which is good for reducing their impact. On the other hand, resilvering isn't quite as thorough a check of the pool's total health as a scrub is.

PS: I'm not sure if either a scrub or a resilver reads the ZIL. Based on some code in zil.c, I suspect that it's checked only during scrubs, which would make sense. Alternately, zil.c is reusing ZIO_FLAG_SCRUB for non-scrub IO for some of its side effects, but that would be a bit weird.

ZFSResilversVsScrubs written at 00:44:26; Add Comment

2017-06-12

Resilvering multiple disks at once in a ZFS pool adds no real extra overhead

Suppose, not entirely hypothetically, that you have a multi-disk, multi-vdev ZFS pool (in our case using mirrored vdevs) and you need to migrate this pool from one set of disks to another set of disks. If you care about doing relatively little damage to pool performance for the duration of the resilvering, are you better off replacing one disk at a time (with repeated resilvers of the pool), or doing a whole bunch of disk replacements at once in one big resilver?

As far as we can tell from both the Illumos ZFS source code and our experience, the answer is that replacing multiple disks at once in a single pool is basically free (apart from the extra IO implied by writing to multiple new disks at once). In particular, a ZFS resilver on mirrored vdevs appears to always read all of the metadata and data in the pool, regardless of how many replacement drives there are and where they are. This means that replacing (or resilvering) multiple drives at once doesn't add any extra read IO; you do the same amount of reads whether you're replacing one drive or ten.

(This is unlike conventional RAID10, where replacing multiple drives at once will probably add additional read IO, which will affect array performance.)

For metadata, this is not particularly surprising. Since metadata is stored all across the pool, metadata located in one vdev can easily point to things located on another one. Given this, you have to read all metadata in the pool in order to even find out what is on a disk that's being resilvered. In theory I think that ZFS could optimize handling leaf data to skip stuff that's known to be entirely on unaffected vdevs; in practice, I can't find any sign in the code that it attempts this optimization for resilvers, and there are rational reasons to skip it anyway.

(As things are now, after a successful resilver you know that the pool has no permanent damage anywhere. If ZFS optimized resilvers by skipping reading and thus checking data on theoretically unaffected vdevs, you wouldn't have this assurance; you'd have to resilver and then scrub to know for sure. Resilvers are somewhat different from scrubs, but they're close.)

This doesn't mean that replacing multiple disks at once won't have any impact on your overall system, because your system may have overall IO capacity limits that are affected by adding more writes. For example, our ZFS fileservers each have a total write bandwidth of 200 Mbytes/sec across all their ZFS pool disks (since we have two 1G iSCSI networks). At least in theory we could saturate this total limit with resilver write traffic alone, and certainly enough resilver write traffic might delay normal user write traffic (especially at times of high write volume). Of course this is what ZFS scrub and resilver tunables are about, so maybe you want to keep an eye on that.

(This also ignores any potential multi-tenancy issues, which definitely affect us at least some of the time.)

ZFS does optimize resilvering disks that were only temporarily offline, using what ZFS calls a dirty time log. The DTL can be used to optimize walking the ZFS metadata tree in much the same way as how ZFS bookmarks work.

Sidebar: How ZFS resilvers (seem to) repair data, especially in RAIDZ

If you look at the resilvering related code in vdev_mirror.c and especially vdev_raidz.c, how things actually get repaired seems pretty mysterious. Especially in the RAIDZ case, ZFS appears to just go off and issue a bunch of ZFS writes to everything without paying any attention to new disks versus old, existing disks. It turns out that the important magic seems to be in zio_vdev_io_start in zio.c, where ZFS quietly discards resilver write IOs to disks unless the target disk is known to require it. The best detailed explanation is in the code's comments:

/*
 * If this is a repair I/O, and there's no self-healing involved --
 * that is, we're just resilvering what we expect to resilver --
 * then don't do the I/O unless zio's txg is actually in vd's DTL.
 * This prevents spurious resilvering with nested replication.
 * For example, given a mirror of mirrors, (A+B)+(C+D), if only
 * A is out of date, we'll read from C+D, then use the data to
 * resilver A+B -- but we don't actually want to resilver B, just A.
 * The top-level mirror has no way to know this, so instead we just
 * discard unnecessary repairs as we work our way down the vdev tree.
 * The same logic applies to any form of nested replication:
 * ditto + mirror, RAID-Z + replacing, etc.  This covers them all.
 */

It appears that preparing and queueing all of these IOs that will then be discarded does involve some amount of CPU, memory allocation for internal data structures, and so on. Presumably this is not a performance issue in practice, especially if you assume that resilvers are uncommon in the first place.

(And certainly the code is better structured this way, especially in the RAIDZ case.)

ZFSMultidiskResilversFree written at 22:29:35; Add Comment

2017-05-17

Unfortunately I don't feel personally passionate about OmniOS

In light of recent events, OmniOS badly needs people who really care about it, the kind of people who are willing and eager to contribute their own time and efforts to the generally thankless work of keeping a distribution running; integrating upstream changes, backporting security and bug updates, maintaining various boring pieces of infrastructure, and so on. There is an idealistic portion of me that would like me to be such a person, and thus that wants me to step forward to volunteer for things here (on the OmniOS mailing list, some people have recently gotten the ball rolling to line up volunteers for various jobs).

Unfortunately, I have to be realistic here (and it would be irresponsible not to be). Perhaps some past version of me is someone who might have had that sort of passion for OmniOS at the right time in his life, but present-day me does not (for various reasons, including that I have other things going on in my life). Without this passion, if I made the mistake of volunteering to do OmniOS-related things in an excess of foolishness I would probably burn out almost immediately; it would be all slogging work and no particular feeling of reward. And I wouldn't even be getting paid for it.

Can I contribute as part of my job? Unfortunately, this is only a maybe, because of two unfortunate realities here. Reality number one is that while we really like our OmniOS fileservers, OmniOS and even ZFS are not absolutely essential to us; although it would be sort of painful, we could get by without OmniOS or even ZFS. Thus there is only so much money, staff time, or other resource that it's sensible for work to 'spend' on OmniOS. Reality number two is that it's only sensible for work to devote resources to supporting OmniOS if we actually continue to use OmniOS over the long term, and that is definitely up in the air right now. Volunteering long-term ongoing resources for OmniOS now (whether that's some staff time or some servers) is premature. Unfortunately there's somewhat of a chicken and egg problem; we might be more inclined to volunteer resources if it was clear that OmniOS was going to be viable going forward, but volunteering resources now is part of what OmniOS needs in order to be viable.

(Given our 10G problem, it's not clear that even a fully viable OmniOS would be our OS of choice for our next generation of fileservers.)

(This echoes what I said in my sidebar on the original entry at the time, but I've been thinking over this recently in light of volunteers starting to step forward on the OmniOS mailing list. To be honest, it makes me feel a bit bad to be someone who isn't stepping forward here, when there's a clear need for people. So I have to remind myself that stepping forward without passion to sustain me would be a terrible mistake.)

OmniOSNoPersonalPassion written at 02:41:25; Add Comment

2017-05-07

ZFS's zfs receive has no error recovery and what that implies

Generally, 'zfs send' and 'zfs receive' are a great thing about ZFS. They give you easy, efficient, and reliable replication of ZFS filesystems, which is good for both backups and transfers (and testing). The combination is among my favorite ways of shuffling filesystems around, right up there with dump and restore and ahead of rsync. But there is an important caveat about zfs receive that deserves to be more widely known, and that is that zfs receive makes no attempt to recover from a partially damaged stream of data.

Formats and programs like tar, restore, and so on generally have provisions for attempting to skip over damaged sections of the stream of data they're supposed to restore. Sometimes this is explicitly designed into the stream format; other times the programs just use heuristics to try to re-synchronize a damaged stream at the next recognizable object. All of these have the goal of letting you recover at least something from a stream that's been stored and damaged in that storage, whether it's a tarball that's experienced some problems on disk or a dump image written to tape and now the tape has a glitch.

zfs receive does not do any of this. If your stream is damaged in any way, zfs receive aborts completely. In order to recover anything at all from the stream, it must be perfect and completely undamaged. This is generally okay if you're directly feeding a 'zfs send' to 'zfs receive'; an error means something bad is going on, and you can retry the send immediately. This is not good if you are storing the 'zfs send' output in a file before using 'zfs receive'; any damage or problem to the file, and it's completely unrecoverable and the file is useless. If the 'zfs send' file is your only copy of the data, the data is completely lost.

There are some situations where this lack of resilience for saved send streams is merely annoying. If you're writing the 'zfs send' output to a file on media as a way of transporting it around and the file gets damaged, you can always redo the whole process (just as you could redo a 'zfs send | zfs receive' pipeline). But in other situations this is actively dangerous, for example if the file is the only form of backup you have or the only copy of the data. Given this issue I now strongly discourage people from storing 'zfs send' output unless they're sure they know what they're doing and they can recover from restore failures.

(See eg this discussion or this one.)

I doubt that ZFS will ever change this behavior, partly because it would probably require a change in the format of the send/receive data stream. My impression is that ZFS people have never considered it a good idea to store 'zfs send' streams, even if they never said this very strongly in things like documentation.

(You also wouldn't want continuing to be the default behavior for 'zfs receive', at least in normal use. If you have a corrupt stream and you can retry it, you definitely want to.)

ZFSReceiveNoErrorRecovery written at 19:36:26; Add Comment

2017-04-24

What we need from an Illumos-based distribution

It started on Twitter:

@thatcks: Current status: depressing myself by looking at the status of Illumos distributions. There's nothing else like OmniOS out there.

@ptribble: So what are the values of OmniOS that are important to you and you feel aren't present in other illumos distributions?

This is an excellent question from Peter Tribble of Tribblix and it deserves a more elaborate answer than I could really give it on Twitter. So here's my take on what we need from an Illumos distribution.

  • A traditional Unix server environment that will do NFS fileservice with ZFS, because we already have a lot of tooling and expertise in running such an Illumos-based environment and our current environment has a very particular setup due to local needs. If we have to completely change how we design and operate NFS fileservice (for example to move to a more 'storage appliance' environment), the advantages of continuing with Illumos mostly go away. If we're going to have to drastically redesign our environment no matter what, maybe the simpler redesign is 'move to ZFS on FreeBSD doing NFS service as a traditional Unix server'.

  • A distribution that is actively 'developed', by which I mean that it incorporates upstream changes from Illumos and other sources on a regular basis and the installer is updated as necessary and so on. Illumos is not a static thing; there's important evolution in ZFS fixes and improvements, hardware drivers, and various other components.

  • Periodic stable releases that are supported with security updates and important bug fixes for a few years but that do not get wholesale rolling updates from upstream Illumos. We need this because testing, qualifying, and updating to a whole new release (with a wide assortment of changes) is a very large amount of work and risk. We can't possibly do it on a regular basis, such as every six months; even once a year is too much. Two years of support is probably our practical minimum.

    We could probably live with 'security updates only', although it'd be nice to be able to get high priority bug fixes as well (here I'm thinking of things that will crash your kernel or cause ZFS corruption, which are very close to 'must fix somehow' issues for people running ZFS fileservers).

    We don't need the ability to upgrade from stable release to stable release. For various reasons we're probably always going to upgrade by installing from scratch on new system disks and then swapping things around in downtimes.

  • An installer that lets us install systems from USB sticks or DVD images and doesn't require complicated backend infrastructure. We're not interested in big complex environments where we have to PXE-boot initial installs of servers, possibly with various additional systems to auto-configure them and tell them what to do.

In a world with pkgsrc and similar sources of pre-built and generally maintained packages, I don't think we need the distribution itself to have very many packages (I'm using the very specific 'we' here, meaning my group and our fileservers). Sure, it would be convenient for us if the distribution shipped with Postfix and Python and a few other important things, but it's not essential the way the core of Illumos is. While it would be ideal if the distribution owned everything important the way that Debian, FreeBSD, and so on do, it doesn't seem like Illumos distributions are going to have that kind of person-hours available, even for a relatively small package collection.

With that said I don't need for all packages to come from pkgsrc or whatever; I'm perfectly happy to have a mix of maintained packages from several sources, including the distribution in either their main source or an 'additional packages we use' repository. Since there's probably not going to be a plain-server NFS fileserver focused Illumos distribution, I'd expect any traditional Unix style distribution to have additional interests that lead to them packaging and maintaining some extra packages, whether that's for web service or X desktops or whatever.

(I also don't care about the package format that the distribution uses. If sticking with IPS is the easy way, that's fine. Neither IPS nor pkgsrc are among my favorite package management systems, but I can live with them.)

Out of all of our needs, I expect the 'stable releases' part to be the biggest problem. Stable releases are a hassle for distribution maintainers (or really for maintainers of anything); you have to maintain multiple versions and you may have to backport a steadily increasing amount of things over time. The amount of pain involved in them is why we're probably willing to live with only security updates for a relatively minimal set of packages and not demand backported bugfixes.

(In practice we don't expect to hit new bugs once our fileservers have gone into production and been stable, although it does happen every once in a while.)

Although 10G Ethernet support is very important to us in general, I'm not putting it on this list because I consider it a general Illumos issue, not something that's going to be specific to any particular distribution. If Illumos as a whole has viable 10G Ethernet for us, any reasonably up to date distribution should pick it up, and we don't want to use a distribution that's not picking those kind of things up.

Sidebar: My current short views on other Illumos distributions

Peter Tribble also asked what was missing in existing Illumos distributions. Based on an inspection of the Illumos wiki's options, I can split the available distributions into three sets:

  • OpenIndiana and Tribblix are developed and more or less traditional Unix systems, but don't appear to have stable releases that are maintained for a couple of years; instead there are periodic new releases with all changes included.

  • SmartOS, Nexenta, and napp-it are Illumos based but as far as I can tell aren't in the form of a traditional Unix system. (I'm not sure if napp-it is actively updated, but the other two are.)

  • the remaining distributions don't seem to be actively developed and may not have maintained stable releases either (I didn't look deeply).

Hopefully you can see why OmniOS hit a sweet spot for us; it is (or was) actively maintained, it has 'long term stable' releases that are supported for a few years, and you get a traditional Unix OS environment and a straightforward installation system.

IllumosDistributionNeeds written at 22:18:16; Add Comment

2017-04-21

OmniOS's (suddenly) changed future and us

Today, Robert Treat, the CEO of OmniTI, sent a message to the OmniOS user discussion list entitled The Future of OmniOS. The important core of that email is in two sentences:

Therefore, going forward, while some of our staff may continue contributing, OmniTI will be suspending active development of OmniOS. Our next release, currently in beta, will become the final release from OmniTI. [...]

As Treat's message explains, OmniTI has been basically the only major contributor and organizing force behind OmniOS, and no real outside development community has materialized around it. OmniTI is no longer interested in having things continue this way and so they are opting for what Treat adequately calls 'a radical approach'.

(There is much more to Treat's email and if you use OmniOS you should read it in full.)

For various reasons, I don't expect this attempt to materialize a community to be successful. This obviously means that OmniOS will probably wither away as a viable and usable Illumos distribution, especially if you want to more or less freeze a version and then only have important bug fixes or security fixes ported into it (which is what we want). This obviously presents a problem for us, since we use OmniOS on our current generation fileservers. I think that we'll be okay for the remaining lifetime of the current generation, but that's mostly because we're approaching the point where we should start work on the next generation (which I have somewhat optimistically put at some time in 2018 in previous entries).

(We ran our first generation of fileservers for longer than that, but it wasn't entirely a great experience so we'd like to not push things so far this time around.)

One option for our next generation of fileservers is another Illumos distribution. I have no current views here since I haven't looked at what's available since we picked OmniOS several years ago, but investigating the current Illumos landscape is clearly now a priority. However, I have some broad concerns about Illumos in general and 10G Ethernet support on Illumos in specific, because I think 10G is going to be absolutely essential in our next generation of fileservers (living without 10G in our current generation is already less than ideal). Another option is ZFS on a non-Illumos platform, either Linux or FreeBSD, which seems to be a viable option now and should likely be even more viable in 2018.

(Oracle Solaris 11 remains very much not an option for all sorts of reasons, including my complete lack of faith and trust in Oracle.)

Fortunately we probably don't have to make any sort of decision any time soon. We'll probably have at least until the end of 2017 to see how the OmniOS situation develops, watch changes in other Illumos distributions, and so on. And maybe I will be surprised by what happens with OmniOS, as I actually was with Illumos flourishing outside of Sun Oracle.

(The one thing that could accelerate our plans is some need to buy new fileserver hardware sooner than expected because we have a limited time funding source available for it. But given a choice we'd like to defer hardware selection for a relatively long time, partly because perhaps a SSD-based future can come to pass.)

Sidebar: On us (not) contributing to OmniOS development

The short version is that we don't have the budget (in money or in free staff time) to contribute to OmniOS as an official work thing and I'm not at all interested in doing it on a purely personal basis on my own time.

OmniOSChangedFuture written at 23:08:06; Add Comment

2017-03-27

We're probably going to upgrade our OmniOS servers by reinstalling them

We're currently running OmniOS r151014 on our fileservers, which is the current long term support release (although we're behind on updates, because we avoid them for stability reasons). However, per the OmniOS release cycle, there's a new LTS release coming this summer and about six to nine months later, our current r151014 version will stop being supported at all. Despite what I wrote not quite a year ago about how we might not upgrade at all, we seem to be broadly in support of the idea of upgrading when the next LTS release is out in order to retain at least the option of applying updates for security issues and so on.

This raises the question of how we do it, because there are two possible options; we could reinstall (what we did the last time around), or upgrade the existing systems through the normal process with a new boot environment. Having thought about it, I think that I'm likely to argue for upgrading via full reinstalls (on new system disks). There's two reasons for this, one specific to this particular version change and one more general one.

The specific issue is that OmniOS is in the process of transitioning to a new bootloader; they're moving from an old version of Grub to a version of the BSD bootloader (which OmniOS calls the 'BSD Loader'). While it's apparently going to be possible to stick with Grub or switch bootloaders over the transition, the current OmniOS Bloody directions make this sound pretty intricate. Installing a new OmniOS from scratch on new disks seems to be the cleanest and best way to get the new bootloader for the new OmniOS while preserving Grub for the old OmniOS (on the old disks).

The more broader issue is that reinstalling from scratch on new disks every time is more certain for rollbacks (since we can keep the old disks) and means that any hypothetical future systems we install wind up the same as the current ones without making us go through extra work. If we did in-place upgrades, to get identical new installs we would actually have to install r151014 then immediately upgrade it to the new LTS. If we just installed the new LTS, there are various sorts of subtle differences and incompatibilities that could sneak in.

(This is of course not specific to OmniOS. It's very hard to make sure that upgraded systems are exactly the same as newly installed systems, especially if you've upgraded the systems over significant version boundaries.)

I like the idea of upgrading between OmniOS versions using boot environments in theory (partly because it's neat if it works), it would probably be faster and less of a hassle, and I may yet change my mind here. But I suspect that we're going to do it the tedious way just because it's easier on us in the long run.

OmniOSUpgradesViaReinstalls written at 01:45:33; Add Comment

2017-03-03

Some notes on ZFS per-user quotas and their interactions with NFS

In addition to quotas on filesystems themselves (refquota) and quotas on entire trees (plain quota), ZFS also supports per-filesystem quotas on how much space users (or groups) can use. We haven't previously used these for various reasons, but today we had a situation with an inaccessible runaway user process eating up all the free space in one pool on our fileservers and we decided to (try to) stop it by sticking a quota on the user. The result was reasonably educational and led to some additional educational experimentation, so now it's time for notes.

User quotas for a user on a filesystem are created by setting the userquota@<user> property of the filesystem to some appropriate value. Unlike overall filesystem and tree quotas, you can set a user quota that is below the user's current space usage. To see the user's current space usage, you look at userused@<user> (which will have its disk space number rounded unless you use 'zfs get -p userused@<user> ...'). To clear the user's quota limit after you don't need it any more, set it to none instead of a size.

(The current Illumos zfs manpage has an annoying mistake, where its section on the userquota@<user> property talks about finding out space by looking at the 'userspace@<user>' property, which is the wrong property name. I suppose I should file a bug report.)

Since user quotas are per-filesystem only (as mentioned), you need to know which filesystem or filesystems your errant user is using space on in your pool in order to block a runaway space consumer. In our case we already have some tools for this and had localized the space growth to a single filesystem; otherwise, you may want to write a script in advance so you can freeze someone's space usage at its current level on a collection of filesystems.

(The mechanics are pretty simple; you set the userquota@<user> value to the value of the userspace@<user> property, if it exists. I'd use the precise value unless you're sure no user will ever use enough space on a filesystem to make the rounding errors significant.)

Then we have the issue of how firmly and how fast quotas are enforced. The zfs manpage warns you explicitly:

Enforcement of user quotas may be delayed by several seconds. This delay means that a user might exceed their quota before the system notices that they are over quota and begins to refuse additional writes with the EDQUOT error message.

This is especially the case over NFS (at least NFS v3), where NFS clients may not start flushing writes to the NFS server for some time. In my testing, I saw the NFS client's kernel happily accept a couple of GB of writes before it started forcing them out to the fileserver.

The behavior of an OmniOS NFS server here is somewhat variable. On the one hand, we saw space usage for our quota'd user keep increasing over the quota for a certain amount of time after we applied the quota (unfortunately I was too busy to time it or carefully track it). On the other hand, in testing, if I started to write to an existing but empty file (on the NFS client) once I was over quota, the NFS server refused all writes and didn't put any data in the file. My conclusion is that at least for NFS servers, the user may be able to go over your quota limit by a few hundred megabytes under the right circumstances. However, once ZFS knows that you're over the quota limit a lot of things shut down immediately; you can't make new files, for example (and NFS clients helpfully get an immediate error about this).

(I took a quick look at the kernel code but I couldn't spot where ZFS updates the space usage information in order to see what sort of lag there is in the process.)

I haven't tested what happens to fileserver performance if a NFS client keeps trying to write data after it has hit the quota limit and has started getting EDQUOTA errors. You'd think that the fileserver should be unaffected, but we've seen issues when pools hit overall quota size limits.

(It's not clear if this came up today when the user hit the quota limit and whatever process(es) they were running started to get those EDQUOTA errors.)

ZFSUserQuotaNotes written at 01:01:22; Add Comment

2017-02-24

How ZFS bookmarks can work their magic with reasonable efficiency

My description of ZFS bookmarks covered what they're good for, but it didn't talk about what they are at a mechanical level. It's all very well to say 'bookmarks mark the point in time when [a] snapshot was created', but how does that actually work, and how does it allow you to use them for incremental ZFS send streams?

The succinct version is that a bookmark is basically a transaction group (txg) number. In ZFS, everything is created as part of a transaction group and gets tagged with the TXG of when it was created. Since things in ZFS are also immutable once written, we know that an object created in a given TXG can't have anything under it that was created in a more recent TXG (although it may well point to things created in older transaction groups). If you have an old directory with an old file and you change a block in the old file, the immutability of ZFS means that you need to write a new version of the data block, a new version of the file metadata that points to the new data block, a new version of the directory metadata that points to the new file metadata, and so on all the way up the tree, and all of those new versions will get a new birth TXG.

This means that given a TXG, it's reasonably efficient to walk down an entire ZFS filesystem (or snapshot) to find everything that was changed since that TXG. When you hit an object with a birth TXG before (or at) your target TXG, you know that you don't have to visit the object's children because they can't have been changed more recently than the object itself. If you bundle up all of the changed objects that you find in a suitable order, you have an incremental send stream. Many of the changed objects you're sending will contain references to older unchanged objects that you're not sending, but if your target has your starting TXG, you know it has all of those unchanged objects already.

To put it succinctly, I'll quote a code comment from libzfs_core.c (via):

If "from" is a bookmark, the indirect blocks in the destination snapshot are traversed, looking for blocks with a birth time since the creation TXG of the snapshot this bookmark was created from. This will result in significantly more I/O and be less efficient than a send space estimation on an equivalent snapshot.

(This is a comment about getting a space estimate for incremental sends, not about doing the send itself, but it's a good summary and it describes the actual process of generating the send as far as I can see.)

Yesterday I said that ZFS bookmarks could in theory be used for an imprecise version of 'zfs diff'. What makes this necessarily imprecise is that while scanning forward from a TXG this way can tell you all of the new objects and it can tell you what is the same, it can't explicitly tell you what has disappeared. Suppose we delete a file. This will necessarily create a new version of the directory the file was in and this new version will have a recent TXG, so we'll find the new version of the directory in our tree scan. But without the original version of the directory to compare against we can't tell what changed, just that something did.

(Similarly, we can't entirely tell the difference between 'a new file was added to this directory' and 'an existing file had all its contents changed or rewritten'. Both will create new file metadata that will have a new TXG. We can tell the case of a file being partially updated, because then some of the file's data blocks will have old TXGs.)

Bookmarks specifically don't preserve the original versions of things; that's why they take no space. Snapshots do preserve the original versions, but they take up space to do that. We can't get something for nothing here.

(More useful sources on the details of bookmarks are this reddit ZFS entry and a slide deck by Matthew Ahrens. Illumos issue 4369 is the original ZFS bookmarks issue.)

Sidebar: Space estimates versus actually creating the incremental send

Creating the actual incremental send stream works exactly the same for sends based on snapshots and sends based on bookmarks. If you look at dmu_send in dmu_send.c, you can see that in the case of a snapshot it basically creates a synthetic bookmark from snapshot's creation information; with a real bookmark, it retrieves the data through dsl_bookmark_lookup. In both cases, the important piece of data is zmb_creation_txg, the TXG to start from.

This means that contrary to what I said yesterday, using bookmarks as the origin for an incremental send stream is just as fast as using snapshots.

What is different is if you ask for something that requires estimating the size of the incremental sends. Space estimates for snapshots are pretty efficient because they can be made using information about space usage in each snapshot. For details, see the comment before dsl_dataset_space_written in dsl_dataset.c. Estimating the space of a bookmark based incremental send requires basically doing the same walk over the ZFS object tree that will be done to generate the send data.

(The walk over the tree will be somewhat faster than the actual send, because in the actual send you have to read the data blocks too; in the tree walk, you only need to read metadata.)

So, you might wonder how you ask for something that requires a space estimate. If you're sending from a snapshot, you use 'zfs send -v ...'. If you're sending from a bookmark or a resume token, well, apparently you just don't; sending from a bookmark doesn't accept -v and -v on resume tokens means something different from what it does on snapshots. So this performance difference is kind of a shaggy dog story right now, since it seems that you can never actually use the slow path of space estimates on bookmarks.

ZFSBookmarksMechanism written at 00:26:44; Add Comment

2017-02-22

ZFS bookmarks and what they're good for

Regular old fashioned ZFS has filesystems and snapshots. Recent versions of ZFS add a third object, called bookmarks. Bookmarks are described like this in the zfs manpage (for the 'zfs bookmark' command):

Creates a bookmark of the given snapshot. Bookmarks mark the point in time when the snapshot was created, and can be used as the incremental source for a zfs send command.

ZFS on Linux has an additional explanation here:

A bookmark is like a snapshot, a read-only copy of a file system or volume. Bookmarks can be created extremely quickly, compared to snapshots, and they consume no additional space within the pool. Bookmarks can also have arbitrary names, much like snapshots.

Unlike snapshots, bookmarks can not be accessed through the filesystem in any way. From a storage standpoint a bookmark just provides a way to reference when a snapshot was created as a distinct object. [...]

The first question is why you would want bookmarks at all. Right now bookmarks have one use, which is saving space on the source of a stream of incremental backups. Suppose that you want to use zfs send and zfs receive to periodically update a backup. At one level, this is no problem:

zfs snapshot pool/fs@current
zfs send -Ri previous pool/fs@current | ...

The problem with this is that you have to keep the previous snapshot around on the source filesystem, pool/fs. If space is tight and there is enough data changing on pool/fs, this can be annoying; it means, for example, that if people delete some files to free up space for other people, they actually haven't done so because the space is being held down by that snapshot.

The purpose of bookmarks is to allow you to do these incremental sends without consuming extra space on the source filesystem. Instead of having to keep the previous snapshot around, you instead make a bookmark based on it, delete the snapshot, and then do the incremental zfs send using the bookmark:

zfs snapshot pool/fs@current
zfs send -i #previous pool/fs@current | ...

This is apparently not quite as fast as using a snapshot, but if you're using bookmarks here it's because the space saving is worth it, possibly in combination with not having to worry about unpredictable fluctuations in how much space a snapshot is holding down as the amount of churn in the filesystem varies.

(We have a few filesystems that get frequent snapshots for fast recovery of user-deleted files, and we live in a certain amount of concern that someday, someone will dump a bunch of data on the filesystem, wait just long enough for a scheduled snapshot to happen, and then either move the data elsewhere or delete it. Sorting that one out to actually get the space back would require deleting at least some snapshots.)

Using bookmarks does require you to keep the previous snapshot on the destination (aka backup) filesystem, although the manpage only tells you this by implication. I believe that this implies that while you're receiving a new incremental, you may need extra space over and above what the current snapshot requires for space, since you won't be able to delete previous and recover its space until the incremental receive finishes. The relevant bit from the manpage is:

If an incremental stream is received, then the destination file system must already exist, and its most recent snapshot must match the incremental stream's source. [...]

This means that the destination filesystem must have a snapshot. This snapshot will and must match a bookmark made from it, since otherwise incremental send streams from bookmarks wouldn't work.

(In theory bookmarks could also be used to generate an imprecise 'zfs diff' without having to keep the origin snapshot around. In practice I doubt anyone is going to implement this, and why it's necessarily imprecise requires an explanation of why and how bookmarks work.)

ZFSBookmarksWhatFor written at 23:58:39; Add Comment

(Previous 10 or go back to January 2017 at 2017/01/13)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.