2024-02-19
The flow of activity in the ZFS Intent Log (as I understand it)
The ZFS Intent Log (ZIL) is a confusing thing once you get into the details, and for reasons beyond the scope of this entry I recently needed to sort out the details of some aspects of how it works. So here is what I know about how things flow into the ZIL, both in memory and then on to disk.
(As always, there is no single 'ZFS Intent Log' in a ZFS pool. Each dataset (a filesystem or a zvol) has its own logically separate ZIL. We talk about 'the ZIL' as a convenience.)
When you perform activities that modify a ZFS dataset, each activity
creates its own ZIL log record (a transaction in ZIL jargon,
sometimes called an 'itx', probably short for 'intent transaction')
that is put into that dataset's in-memory ZIL log. This includes
both straightforward data writes and metadata activity like creating
or renaming files. You can see a big list of all of the possible
transaction types in zil.h as
all of the TX_*
definitions (which have brief useful comments).
In-memory ZIL transactions aren't necessarily immediately flushed
to disk, especially for things like simply doing a write()
to
a file. The reason that plain write()
s to a file are (still) given
ZIL transactions is that you may call fsync()
on the file later.
If you don't call fsync()
and the regular ZFS transaction group
commits with your write()
s, those ZIL transactions will be quietly
cleaned out of the in-memory ZIL log (along with all of the other now
unneeded ZIL transactions).
(All of this assumes that your dataset doesn't have 'sync=disabled
'
set, which turns off the in-memory ZIL as one of its effects.)
When you perform an action such as fsync()
or sync()
that
requests that in-memory ZFS state be made durable on disk, ZFS
gathers up some or all of those in-memory ZIL transactions and
writes them to disk in one go, as a sequence of log (write) blocks
('lwb' or 'lwbs' in ZFS source code), which pack together those ZIL
transaction records. This is called a ZIL commit. Depending on
various factors, the
flushed out data you write()
may or may not be included in the
log (write) blocks committed to the (dataset's) ZIL. Sometimes your
file data will be written directly into its future permanent location
in the pool's free space (which is safe)
and the ZIL commit will have only a pointer to this location (its
DVA).
(For a discussion of this, see the comments about the WR_*
constants in zil.h. Also, while in memory, ZFS transactions
are classified as either 'synchronous' or 'asynchronous'.
Sync transactions are always part of a ZIL commit, but async
transactions are only included as necessary. See zil_impl.h
and also my entry discussing this.)
It's possible for several processes (or threads) to all call sync()
or fsync()
at once (well, before the first one finishes committing
the ZIL). In this case, their requests can all be merged together
into one ZIL commit that covers all of them. This means that fsync()
and sync()
calls don't necessarily match up one to one with ZIL
commits. I believe it's also possible for a fsync()
or sync()
to not result in a ZIL commit if all of the relevant data has already
been written out as part of a regular ZFS transaction group (or a
previous request).
Because of all of this, there are various different ZIL related metrics that you may be interested in, sometimes with picky but important differences between them. For example, there is a difference between 'the number of bytes written to the ZIL' and 'the number of bytes written as part of ZIL commits', since the latter would include data written directly to its final space in the main pool. You might care about the latter when you're investigating the overall IO impact of ZIL commits but the former if you're looking at sizing a separate log device (a 'slog' in ZFS terminology).