Wandering Thoughts archives

2013-07-11

The ZFS ZIL's optimizations for data writes

In yesterday's entry on the ZIL I mentioned that the ZIL has some clever optimizations for large write()s. To understand these (and some related ZFS filesystem properties), let's start with the fundamental problem.

A simple, straightforward filesystem journal simply includes a full copy of each operation or transaction that it's recording. Many of these full copies will be small (for metadata operations like file renames), but for data writes you need to include the data being written. Now suppose that you are routinely writing a lot of data and then fsync()'ing it. This will wind up with the filesystem writing two copies of that large data, one copy recorded with the journal and then a second copy written to the actual live filesystem. This is inefficient and worse, it costs you both disk seeks (between the location of the journal and the final location of data) and write bandwidth.

Because ZFS is a copy-on-write filesystem where old data is never overwritten in place, it can optimize this process in a straightforward way. Rather than putting the new data into the journal it can directly write the new data to its final (new) location in the filesystem and then simply record that new location in the journal. However, this is now a tradeoff; in exchange for not writing the data twice you're forcing the journal commit to wait for a separate (and full) data write, complete with an extra seek between the journal and the final location of the data. For sufficiently small amounts of data this tradeoff is not worth it and you're better off just writing an extra copy of the data to the journal without waiting. In ZFS, this division point is set by the global tuneable variable zfs_immediate_write_sz. Data writes larger than this size will be pushed directly to their final location and the ZIL will only include a pointer to it.

Actually that's a lie. The real situation is rather more complicated.

First, if the data write is larger than the file's blocksize it is always put into the on-disk ZIL (possibly because otherwise the ZIL would have to record multiple pointers to its final location since it will be split across multiple blocks, which could get complicated). Next, you can set filesystems to have 'logbias=throughput'; such a filesystem writes all data blocks to their final locations (among other effects). Finally, if you have a separate log device (with a normal logbias) data writes will always go into the log regardless of their size, even for very large writes.

So in summary zfs_immediate_write_sz only makes a difference if you are using logbias=latency and do not have a separate log device, which can basically be summarized as 'if you have a normal pool without any sort of special setup'. If you are using logbias=throughput it is effectively 0; if you have a separate log device it is effectively infinite.

Update (October 13 2013): It turns out that this description is not quite complete. See part 2 for an important qualification.

Sidebar: slogs and logbias=throughput

Note that there is no point in having a separate log device and setting logbias=throughput on all of your filesystems, because the latter makes the filesystems not use your slog. This is implicit in the description of throughput's behavior but may not be clear enough. 'Throughput' is apparently intended for situations where you want to preserve your slog bandwidth and latency for filesystems where ZIL commit latency is very important; you set everything else to logbias=throughput so that they don't touch the slog.

If you have an all-SSD pool with no slogs it may make sense to set logbias=throughput on everything in it. Seeks are basically free on the SSDs and you'll probably wind up with less overall bandwidth to the SSDs used since you're writing less data. Note that I haven't measured or researched this.

solaris/ZFSWritesAndZIL written at 15:06:48; Add Comment

ZFS transaction groups and the ZFS Intent Log

I've just been digging around in the depths of the ZIL and of ZFS transaction groups, so before I forget everything I've figured out I'm going to write it down (partly because when I went looking I couldn't find any really detailed information on this stuff). The necessary disclaimer is that all of this is as far as I can tell from my own research and code reading and thus I could be wrong about some of it.

Let's start with transaction groups. All write operations in ZFS are part of a transaction and every transaction is part of a 'transaction group' (a TXG in the ZFS jargon). TXGs are numbered sequentially and always commit in sequential order, and there is only one open TXG at any given time. Because ZFS immediately attaches all write IO to a transaction and thus a TXG, ZFS-level write operations cannot cross each other at the TXG level; if two writes are issued in order either they are both part of the same TXG or the second write is in a later TXG (and TXGs are atomic, which is the core of ZFS's consistency guarantees).

(An excellent long discussion of how ZFS transactions work is at the start of txg.c in the ZFS source.)

ZFS also has the ZIL aka the ZFS Intent Log. The ZIL exists because of the journaling fsync() problem: you don't want to have to flush out a huge file just because someone wanted to fsync() a small one (that gets you slow fsync()s and unhappy people). Without some sort of separate log all ZFS could do to force things to disk would be to immediately commit the entire current transaction group, which drags all uncommitted write operations with it whether or not they have anything to do with the file being fsync()'d.

One of the confusing things about the ZIL is that it's common to talk about 'the ZIL' when this is not really the case. Each filesystem and zvol actually has its own separate ZIL which are all written to and recovered separately from each other (although if you have separate log devices the ZILs are all normally stored on the slog devices). We also need to draw a distinction between the on-disk ZIL and the in-memory 'ZIL' structure (implicitly for a particular dataset). The on-disk ZIL has committed records while the in-memory ZIL holds records that have not yet been committed (or expired because their TXG committed). A ZIL commit is the process of taking some or all of the in-memory ZIL records and flushing them to disk.

Because ZFS doesn't know in advance what's going to be fsync()'d, the in-memory ZIL holds a record of all write operations done to the dataset. The ZIL has the concept of two sorts of write operations, 'synchronous' and 'asynchronous', and two sorts of ZIL commits, general and file-specific. Sync writes are always committed when the ZIL is committed; async writes are not committed if the ZIL is doing a file-specific commit and they are for a different file. ZFS metadata operations like creating or renaming files are synchronous while data writes are generally but not always asynchronous. For obvious reasons fsync() does a file-specific ZIL commit, as do the other ways of forcing synchronous write IO.

If the ZIL is active for a dataset the dataset no longer has strong write ordering properties for data that is not explicitly flushed to disk via fsync() or the like. Because of a performance hack for fsync() this currently extends well beyond the obvious case of writing one file, writing a second file, and fsync()'ing the second file; in some cases write data will be included in a ZIL commit even though it has not been explicitly flushed.

(If you want the gory details, see: 1, 2, 3. This applies to all versions of ZFS, not just ZFS on Linux.)

ZIL records, both in memory and on disk, are completely separate from the transactions that are part of transaction groups and they're not read from either memory or disk in the process of committing a transaction group. In fact under normal operation on-disk ZIL records are never read at all. This can sometimes be a problem if you have separate ZIL log devices because nothing will notice if your log device is throwing away writes (or corrupting them) or can't actually read them back.

(I believe that pool scrubs do read the on-disk ZIL as a check but I'm not entirely sure.)

Modern versions of ZFS support a per-filesystem 'sync=' property. What I've described above is the behavior of the 'default' setting for it. A setting of 'always' forces a ZIL commit on every write operation (and as a result has a strong write order guarantee). A setting of 'disabled' disables ZIL commits but not the in-memory ZIL, which will continue to accumulate records between TXG commits and then drop the records when a TXG commits. A filesystem with 'sync=disabled' actually has stronger write ordering guarantees than a filesystem with the ZIL enabled, at the cost of lying to applications about whether data actually is solidly on disk at all (in some cases this may be okay).

(Presumably one reason for keeping the in-memory ZIL active for sync=disabled is so that you can change this property back and have fsync() immediately start doing the right thing.)

Under some circumstances the on-disk ZIL uses clever optimizations so that it doesn't have to write out two copies of large write()s (one to the ZIL log for a ZIL commit and then a second to the regular ZFS pool data structures as part of the TXG commit). A discussion of exactly how this works is beyond the scope of this entry, which is already long enough as it is.

(There is a decent comment discussing some more details of the ZIL at the start of zil.c.)

solaris/ZFSTXGsAndZILs written at 00:44:49; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.