Wandering Thoughts archives

2024-02-19

The flow of activity in the ZFS Intent Log (as I understand it)

The ZFS Intent Log (ZIL) is a confusing thing once you get into the details, and for reasons beyond the scope of this entry I recently needed to sort out the details of some aspects of how it works. So here is what I know about how things flow into the ZIL, both in memory and then on to disk.

(As always, there is no single 'ZFS Intent Log' in a ZFS pool. Each dataset (a filesystem or a zvol) has its own logically separate ZIL. We talk about 'the ZIL' as a convenience.)

When you perform activities that modify a ZFS dataset, each activity creates its own ZIL log record (a transaction in ZIL jargon, sometimes called an 'itx', probably short for 'intent transaction') that is put into that dataset's in-memory ZIL log. This includes both straightforward data writes and metadata activity like creating or renaming files. You can see a big list of all of the possible transaction types in zil.h as all of the TX_* definitions (which have brief useful comments). In-memory ZIL transactions aren't necessarily immediately flushed to disk, especially for things like simply doing a write() to a file. The reason that plain write()s to a file are (still) given ZIL transactions is that you may call fsync() on the file later. If you don't call fsync() and the regular ZFS transaction group commits with your write()s, those ZIL transactions will be quietly cleaned out of the in-memory ZIL log (along with all of the other now unneeded ZIL transactions).

(All of this assumes that your dataset doesn't have 'sync=disabled' set, which turns off the in-memory ZIL as one of its effects.)

When you perform an action such as fsync() or sync() that requests that in-memory ZFS state be made durable on disk, ZFS gathers up some or all of those in-memory ZIL transactions and writes them to disk in one go, as a sequence of log (write) blocks ('lwb' or 'lwbs' in ZFS source code), which pack together those ZIL transaction records. This is called a ZIL commit. Depending on various factors, the flushed out data you write() may or may not be included in the log (write) blocks committed to the (dataset's) ZIL. Sometimes your file data will be written directly into its future permanent location in the pool's free space (which is safe) and the ZIL commit will have only a pointer to this location (its DVA).

(For a discussion of this, see the comments about the WR_* constants in zil.h. Also, while in memory, ZFS transactions are classified as either 'synchronous' or 'asynchronous'. Sync transactions are always part of a ZIL commit, but async transactions are only included as necessary. See zil_impl.h and also my entry discussing this.)

It's possible for several processes (or threads) to all call sync() or fsync() at once (well, before the first one finishes committing the ZIL). In this case, their requests can all be merged together into one ZIL commit that covers all of them. This means that fsync() and sync() calls don't necessarily match up one to one with ZIL commits. I believe it's also possible for a fsync() or sync() to not result in a ZIL commit if all of the relevant data has already been written out as part of a regular ZFS transaction group (or a previous request).

Because of all of this, there are various different ZIL related metrics that you may be interested in, sometimes with picky but important differences between them. For example, there is a difference between 'the number of bytes written to the ZIL' and 'the number of bytes written as part of ZIL commits', since the latter would include data written directly to its final space in the main pool. You might care about the latter when you're investigating the overall IO impact of ZIL commits but the former if you're looking at sizing a separate log device (a 'slog' in ZFS terminology).

solaris/ZFSZILActivityFlow written at 21:58:13;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.