ZFS transaction groups and the ZFS Intent Log

July 11, 2013

I've just been digging around in the depths of the ZIL and of ZFS transaction groups, so before I forget everything I've figured out I'm going to write it down (partly because when I went looking I couldn't find any really detailed information on this stuff). The necessary disclaimer is that all of this is as far as I can tell from my own research and code reading and thus I could be wrong about some of it.

Let's start with transaction groups. All write operations in ZFS are part of a transaction and every transaction is part of a 'transaction group' (a TXG in the ZFS jargon). TXGs are numbered sequentially and always commit in sequential order, and there is only one open TXG at any given time. Because ZFS immediately attaches all write IO to a transaction and thus a TXG, ZFS-level write operations cannot cross each other at the TXG level; if two writes are issued in order either they are both part of the same TXG or the second write is in a later TXG (and TXGs are atomic, which is the core of ZFS's consistency guarantees).

(An excellent long discussion of how ZFS transactions work is at the start of txg.c in the ZFS source.)

ZFS also has the ZIL aka the ZFS Intent Log. The ZIL exists because of the journaling fsync() problem: you don't want to have to flush out a huge file just because someone wanted to fsync() a small one (that gets you slow fsync()s and unhappy people). Without some sort of separate log all ZFS could do to force things to disk would be to immediately commit the entire current transaction group, which drags all uncommitted write operations with it whether or not they have anything to do with the file being fsync()'d.

One of the confusing things about the ZIL is that it's common to talk about 'the ZIL' when this is not really the case. Each filesystem and zvol actually has its own separate ZIL which are all written to and recovered separately from each other (although if you have separate log devices the ZILs are all normally stored on the slog devices). We also need to draw a distinction between the on-disk ZIL and the in-memory 'ZIL' structure (implicitly for a particular dataset). The on-disk ZIL has committed records while the in-memory ZIL holds records that have not yet been committed (or expired because their TXG committed). A ZIL commit is the process of taking some or all of the in-memory ZIL records and flushing them to disk.

Because ZFS doesn't know in advance what's going to be fsync()'d, the in-memory ZIL holds a record of all write operations done to the dataset. The ZIL has the concept of two sorts of write operations, 'synchronous' and 'asynchronous', and two sorts of ZIL commits, general and file-specific. Sync writes are always committed when the ZIL is committed; async writes are not committed if the ZIL is doing a file-specific commit and they are for a different file. ZFS metadata operations like creating or renaming files are synchronous while data writes are generally but not always asynchronous. For obvious reasons fsync() does a file-specific ZIL commit, as do the other ways of forcing synchronous write IO.

If the ZIL is active for a dataset the dataset no longer has strong write ordering properties for data that is not explicitly flushed to disk via fsync() or the like. Because of a performance hack for fsync() this currently extends well beyond the obvious case of writing one file, writing a second file, and fsync()'ing the second file; in some cases write data will be included in a ZIL commit even though it has not been explicitly flushed.

(If you want the gory details, see: 1, 2, 3. This applies to all versions of ZFS, not just ZFS on Linux.)

ZIL records, both in memory and on disk, are completely separate from the transactions that are part of transaction groups and they're not read from either memory or disk in the process of committing a transaction group. In fact under normal operation on-disk ZIL records are never read at all. This can sometimes be a problem if you have separate ZIL log devices because nothing will notice if your log device is throwing away writes (or corrupting them) or can't actually read them back.

(I believe that pool scrubs do read the on-disk ZIL as a check but I'm not entirely sure.)

Modern versions of ZFS support a per-filesystem 'sync=' property. What I've described above is the behavior of the 'default' setting for it. A setting of 'always' forces a ZIL commit on every write operation (and as a result has a strong write order guarantee). A setting of 'disabled' disables ZIL commits but not the in-memory ZIL, which will continue to accumulate records between TXG commits and then drop the records when a TXG commits. A filesystem with 'sync=disabled' actually has stronger write ordering guarantees than a filesystem with the ZIL enabled, at the cost of lying to applications about whether data actually is solidly on disk at all (in some cases this may be okay).

(Presumably one reason for keeping the in-memory ZIL active for sync=disabled is so that you can change this property back and have fsync() immediately start doing the right thing.)

Under some circumstances the on-disk ZIL uses clever optimizations so that it doesn't have to write out two copies of large write()s (one to the ZIL log for a ZIL commit and then a second to the regular ZFS pool data structures as part of the TXG commit). A discussion of exactly how this works is beyond the scope of this entry, which is already long enough as it is.

(There is a decent comment discussing some more details of the ZIL at the start of zil.c.)

Comments on this page:

Thank you! This post helps a lot!

Written on 11 July 2013.
« Knowing when to go your own way with open source programs
The ZFS ZIL's optimizations for data writes »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jul 11 00:44:49 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.