ZFS transaction groups and the ZFS Intent Log
I've just been digging around in the depths of the ZIL and of ZFS transaction groups, so before I forget everything I've figured out I'm going to write it down (partly because when I went looking I couldn't find any really detailed information on this stuff). The necessary disclaimer is that all of this is as far as I can tell from my own research and code reading and thus I could be wrong about some of it.
Let's start with transaction groups. All write operations in ZFS are part of a transaction and every transaction is part of a 'transaction group' (a TXG in the ZFS jargon). TXGs are numbered sequentially and always commit in sequential order, and there is only one open TXG at any given time. Because ZFS immediately attaches all write IO to a transaction and thus a TXG, ZFS-level write operations cannot cross each other at the TXG level; if two writes are issued in order either they are both part of the same TXG or the second write is in a later TXG (and TXGs are atomic, which is the core of ZFS's consistency guarantees).
(An excellent long discussion of how ZFS transactions work is at the start of txg.c in the ZFS source.)
ZFS also has the ZIL aka the ZFS Intent Log. The ZIL exists because
of the journaling fsync()
problem:
you don't want to have to flush out a huge file just because someone
wanted to fsync()
a small one (that gets you slow fsync()
s and
unhappy people). Without some sort of separate log all ZFS could do to
force things to disk would be to immediately commit the entire current
transaction group, which drags all uncommitted write operations with it
whether or not they have anything to do with the file being fsync()
'd.
One of the confusing things about the ZIL is that it's common to talk about 'the ZIL' when this is not really the case. Each filesystem and zvol actually has its own separate ZIL which are all written to and recovered separately from each other (although if you have separate log devices the ZILs are all normally stored on the slog devices). We also need to draw a distinction between the on-disk ZIL and the in-memory 'ZIL' structure (implicitly for a particular dataset). The on-disk ZIL has committed records while the in-memory ZIL holds records that have not yet been committed (or expired because their TXG committed). A ZIL commit is the process of taking some or all of the in-memory ZIL records and flushing them to disk.
Because ZFS doesn't know in advance what's going to be fsync()
'd,
the in-memory ZIL holds a record of all write operations done to
the dataset. The ZIL has the concept of two sorts of write operations,
'synchronous' and 'asynchronous', and two sorts of ZIL commits,
general and file-specific. Sync writes are always committed when
the ZIL is committed; async writes are not committed if the ZIL is
doing a file-specific commit and they are for a different file. ZFS
metadata operations like creating or renaming files are synchronous
while data writes are generally but not always asynchronous. For
obvious reasons fsync()
does a file-specific ZIL commit, as do
the other ways of forcing synchronous write IO.
If the ZIL is active for a dataset the dataset no longer has strong
write ordering properties for data that is not explicitly flushed
to disk via fsync()
or the like. Because of a performance hack
for fsync()
this currently extends well beyond the obvious case
of writing one file, writing a second file, and fsync()
'ing the
second file; in some cases write data will be included in a ZIL
commit even though it has not been explicitly flushed.
(If you want the gory details, see: 1, 2, 3. This applies to all versions of ZFS, not just ZFS on Linux.)
ZIL records, both in memory and on disk, are completely separate from the transactions that are part of transaction groups and they're not read from either memory or disk in the process of committing a transaction group. In fact under normal operation on-disk ZIL records are never read at all. This can sometimes be a problem if you have separate ZIL log devices because nothing will notice if your log device is throwing away writes (or corrupting them) or can't actually read them back.
(I believe that pool scrubs do read the on-disk ZIL as a check but I'm not entirely sure.)
Modern versions of ZFS support a per-filesystem 'sync=
' property.
What I've described above is the behavior of the 'default' setting for
it. A setting of 'always' forces a ZIL commit on every write operation
(and as a result has a strong write order guarantee). A setting of
'disabled' disables ZIL commits but not the in-memory ZIL, which
will continue to accumulate records between TXG commits and then drop
the records when a TXG commits. A filesystem with 'sync=disabled
'
actually has stronger write ordering guarantees than a filesystem with
the ZIL enabled, at the cost of lying to applications about whether data
actually is solidly on disk at all (in some cases this may be okay).
(Presumably one reason for keeping the in-memory ZIL active for
sync=disabled
is so that you can change this property back and have
fsync()
immediately start doing the right thing.)
Under some circumstances the on-disk ZIL uses clever optimizations so
that it doesn't have to write out two copies of large write()
s (one to
the ZIL log for a ZIL commit and then a second to the regular ZFS pool
data structures as part of the TXG commit). A discussion of exactly
how this works is beyond the scope of this entry,
which is already long enough as it is.
(There is a decent comment discussing some more details of the ZIL at the start of zil.c.)
|
|