Wandering Thoughts archives

2013-07-11

The ZFS ZIL's optimizations for data writes

In yesterday's entry on the ZIL I mentioned that the ZIL has some clever optimizations for large write()s. To understand these (and some related ZFS filesystem properties), let's start with the fundamental problem.

A simple, straightforward filesystem journal simply includes a full copy of each operation or transaction that it's recording. Many of these full copies will be small (for metadata operations like file renames), but for data writes you need to include the data being written. Now suppose that you are routinely writing a lot of data and then fsync()'ing it. This will wind up with the filesystem writing two copies of that large data, one copy recorded with the journal and then a second copy written to the actual live filesystem. This is inefficient and worse, it costs you both disk seeks (between the location of the journal and the final location of data) and write bandwidth.

Because ZFS is a copy-on-write filesystem where old data is never overwritten in place, it can optimize this process in a straightforward way. Rather than putting the new data into the journal it can directly write the new data to its final (new) location in the filesystem and then simply record that new location in the journal. However, this is now a tradeoff; in exchange for not writing the data twice you're forcing the journal commit to wait for a separate (and full) data write, complete with an extra seek between the journal and the final location of the data. For sufficiently small amounts of data this tradeoff is not worth it and you're better off just writing an extra copy of the data to the journal without waiting. In ZFS, this division point is set by the global tuneable variable zfs_immediate_write_sz. Data writes larger than this size will be pushed directly to their final location and the ZIL will only include a pointer to it.

Actually that's a lie. The real situation is rather more complicated.

First, if the data write is larger than the file's blocksize it is always put into the on-disk ZIL (possibly because otherwise the ZIL would have to record multiple pointers to its final location since it will be split across multiple blocks, which could get complicated). Next, you can set filesystems to have 'logbias=throughput'; such a filesystem writes all data blocks to their final locations (among other effects). Finally, if you have a separate log device (with a normal logbias) data writes will always go into the log regardless of their size, even for very large writes.

So in summary zfs_immediate_write_sz only makes a difference if you are using logbias=latency and do not have a separate log device, which can basically be summarized as 'if you have a normal pool without any sort of special setup'. If you are using logbias=throughput it is effectively 0; if you have a separate log device it is effectively infinite.

Update (October 13 2013): It turns out that this description is not quite complete. See part 2 for an important qualification.

Sidebar: slogs and logbias=throughput

Note that there is no point in having a separate log device and setting logbias=throughput on all of your filesystems, because the latter makes the filesystems not use your slog. This is implicit in the description of throughput's behavior but may not be clear enough. 'Throughput' is apparently intended for situations where you want to preserve your slog bandwidth and latency for filesystems where ZIL commit latency is very important; you set everything else to logbias=throughput so that they don't touch the slog.

If you have an all-SSD pool with no slogs it may make sense to set logbias=throughput on everything in it. Seeks are basically free on the SSDs and you'll probably wind up with less overall bandwidth to the SSDs used since you're writing less data. Note that I haven't measured or researched this.

ZFSWritesAndZIL written at 15:06:48; Add Comment

ZFS transaction groups and the ZFS Intent Log

I've just been digging around in the depths of the ZIL and of ZFS transaction groups, so before I forget everything I've figured out I'm going to write it down (partly because when I went looking I couldn't find any really detailed information on this stuff). The necessary disclaimer is that all of this is as far as I can tell from my own research and code reading and thus I could be wrong about some of it.

Let's start with transaction groups. All write operations in ZFS are part of a transaction and every transaction is part of a 'transaction group' (a TXG in the ZFS jargon). TXGs are numbered sequentially and always commit in sequential order, and there is only one open TXG at any given time. Because ZFS immediately attaches all write IO to a transaction and thus a TXG, ZFS-level write operations cannot cross each other at the TXG level; if two writes are issued in order either they are both part of the same TXG or the second write is in a later TXG (and TXGs are atomic, which is the core of ZFS's consistency guarantees).

(An excellent long discussion of how ZFS transactions work is at the start of txg.c in the ZFS source.)

ZFS also has the ZIL aka the ZFS Intent Log. The ZIL exists because of the journaling fsync() problem: you don't want to have to flush out a huge file just because someone wanted to fsync() a small one (that gets you slow fsync()s and unhappy people). Without some sort of separate log all ZFS could do to force things to disk would be to immediately commit the entire current transaction group, which drags all uncommitted write operations with it whether or not they have anything to do with the file being fsync()'d.

One of the confusing things about the ZIL is that it's common to talk about 'the ZIL' when this is not really the case. Each filesystem and zvol actually has its own separate ZIL which are all written to and recovered separately from each other (although if you have separate log devices the ZILs are all normally stored on the slog devices). We also need to draw a distinction between the on-disk ZIL and the in-memory 'ZIL' structure (implicitly for a particular dataset). The on-disk ZIL has committed records while the in-memory ZIL holds records that have not yet been committed (or expired because their TXG committed). A ZIL commit is the process of taking some or all of the in-memory ZIL records and flushing them to disk.

Because ZFS doesn't know in advance what's going to be fsync()'d, the in-memory ZIL holds a record of all write operations done to the dataset. The ZIL has the concept of two sorts of write operations, 'synchronous' and 'asynchronous', and two sorts of ZIL commits, general and file-specific. Sync writes are always committed when the ZIL is committed; async writes are not committed if the ZIL is doing a file-specific commit and they are for a different file. ZFS metadata operations like creating or renaming files are synchronous while data writes are generally but not always asynchronous. For obvious reasons fsync() does a file-specific ZIL commit, as do the other ways of forcing synchronous write IO.

If the ZIL is active for a dataset the dataset no longer has strong write ordering properties for data that is not explicitly flushed to disk via fsync() or the like. Because of a performance hack for fsync() this currently extends well beyond the obvious case of writing one file, writing a second file, and fsync()'ing the second file; in some cases write data will be included in a ZIL commit even though it has not been explicitly flushed.

(If you want the gory details, see: 1, 2, 3. This applies to all versions of ZFS, not just ZFS on Linux.)

ZIL records, both in memory and on disk, are completely separate from the transactions that are part of transaction groups and they're not read from either memory or disk in the process of committing a transaction group. In fact under normal operation on-disk ZIL records are never read at all. This can sometimes be a problem if you have separate ZIL log devices because nothing will notice if your log device is throwing away writes (or corrupting them) or can't actually read them back.

(I believe that pool scrubs do read the on-disk ZIL as a check but I'm not entirely sure.)

Modern versions of ZFS support a per-filesystem 'sync=' property. What I've described above is the behavior of the 'default' setting for it. A setting of 'always' forces a ZIL commit on every write operation (and as a result has a strong write order guarantee). A setting of 'disabled' disables ZIL commits but not the in-memory ZIL, which will continue to accumulate records between TXG commits and then drop the records when a TXG commits. A filesystem with 'sync=disabled' actually has stronger write ordering guarantees than a filesystem with the ZIL enabled, at the cost of lying to applications about whether data actually is solidly on disk at all (in some cases this may be okay).

(Presumably one reason for keeping the in-memory ZIL active for sync=disabled is so that you can change this property back and have fsync() immediately start doing the right thing.)

Under some circumstances the on-disk ZIL uses clever optimizations so that it doesn't have to write out two copies of large write()s (one to the ZIL log for a ZIL commit and then a second to the regular ZFS pool data structures as part of the TXG commit). A discussion of exactly how this works is beyond the scope of this entry, which is already long enough as it is.

(There is a decent comment discussing some more details of the ZIL at the start of zil.c.)

ZFSTXGsAndZILs written at 00:44:49; Add Comment

2013-07-09

How we want to recover our ZFS pools from SAN outages

Last night I wrote about how I decided to sit on my hands after we had a SAN backend failure, rather than spring into sleepy action to swap in our hot spare backend. This turned out to be exactly the right decision for more than the obvious reasons.

In a SAN environment like ours it's quite possible to lose access to a whole bunch of disks without losing the disks themselves. This is what happened to us last night; the power supply on one disk shelf appears to have flaked out. We swapped out the disk shelf for another one, transplanted the disks themselves back into the new shelf, and the whole iSCSI backend was back on the air. ZFS had long since faulted all of the disks, of course (since it had spent hours being unable to talk to them), but the disks were still in their pools.

(Some RAID systems will actively eject disks from storage arrays if they are too faulted or if they disappear. ZFS doesn't do this. Those disks are in their pools until you remove them yourself.)

With the disks still in their pools, we could use 'zpool clear' to re-activate them (it's an underdocumented side effect of clearing errors). ZFS was smart enough to know that the disks already had most of the pool data and just needed relatively minimal resilvering, which is a lot faster than the full resilvering that pulling in spares needs. Once we had the disks powered up again it took perhaps an hour until all of the pools had their redundancy back (and part of that time was us being cautious about IO load). In some environments this alone might be sufficient, but we've had prior experience that it isn't good enough; we also need to 'zpool scrub' each pool until it reports no errors (this is now in progress). Doing scrubs takes rather a while but at least all the pools have (relatively) full redundancy in the mean time.

(Part of the reason for needing to scrub our disks is that our disks probably have missing writes due to abruptly losing power.)

This sort of recovery is obviously a lot faster, less disruptive, and safer than resilvering terabytes of data by switching over to our hot spare backend (especially if we actively detach the disks from the 'failed' backend before the resilvering has finished). In the future I think we're going to want to recover failed iSCSI backends this way if at all possible. It may be somewhat more manual work (and it requires hands-on attention to swap hardware around) but it's much faster and better.

(In this specific case delaying ten hours or so probably saved us at least a couple of days of resilvering time, during which we would have had several terabytes exposed to single disk failures.)

ZFSRecoveringDisks written at 02:02:18; Add Comment

2013-07-05

ZFS deduplication is terribly documented

One of the things that makes ZFS deduplication so dangerous and so infuriating is that it is terribly documented. My example today is what should be a simple question: does turning ZFS deduplication on irreversibly taint the pool and/or filesystem(s) involved such that you'll have performance issues even if you deleted all data, or can you later turn ZFS deduplication off and return the pool to its pre-dedup state of good performance with enough work?

You can find sources on the Internet that will give you both answers. Oracle's own online documentation is cheerfully silent about this (at least the full documentation does contain warnings about the downsides of dedup, although the zfs(1) manpage still doesn't). The only way to know for sure is to either read kernel source or find a serious ZFS expert and ask them.

(I don't know the answer, although I'd like to.)

This should not be how you find answers to important questions about ZFS dedup. That it is demonstrates how bad the ZFS dedup documentation is, both the official Oracle documentation and most especially the Illumos manpages (because with Illumos, the manpages are mostly it).

By the way, I'm picking on ZFS dedup because ZFS dedup is both a really attractive sounding feature (who doesn't want space savings basically for free, or at least what sounds like free) and probably the single biggest way to have a terrible experience with ZFS. The current state of affairs virtually guarantees a never-ending stream of people blowing their feet off with it and leaving angry.

(The specific question here is very important if you find that dedup is causing you problems. The answer is the difference between having a reasonably graceful and gradual way out or finding yourself facing a potentially major dislocation. And if there is a graceful way out then it's much safer to experiment with dedup.)

ZFSDedupBadDocumentation written at 01:40:36; Add Comment

By day for July 2013: 5 9 11; before July; after July.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.