2013-07-11
The ZFS ZIL's optimizations for data writes
In yesterday's entry on the ZIL I mentioned that the
ZIL has some clever optimizations for large write()
s. To understand
these (and some related ZFS filesystem properties), let's start with the
fundamental problem.
A simple, straightforward filesystem journal simply includes a full copy
of each operation or transaction that it's recording. Many of these full
copies will be small (for metadata operations like file renames), but
for data writes you need to include the data being written. Now suppose
that you are routinely writing a lot of data and then fsync()
'ing it.
This will wind up with the filesystem writing two copies of that large
data, one copy recorded with the journal and then a second copy written
to the actual live filesystem. This is inefficient and worse, it costs
you both disk seeks (between the location of the journal and the final
location of data) and write bandwidth.
Because ZFS is a copy-on-write filesystem where old data is never
overwritten in place, it can optimize this process in a straightforward
way. Rather than putting the new data into the journal it can directly
write the new data to its final (new) location in the filesystem and
then simply record that new location in the journal. However, this
is now a tradeoff; in exchange for not writing the data twice you're
forcing the journal commit to wait for a separate (and full) data write,
complete with an extra seek between the journal and the final location
of the data. For sufficiently small amounts of data this tradeoff is not
worth it and you're better off just writing an extra copy of the data to
the journal without waiting.
In ZFS, this division point is set by the global tuneable variable
zfs_immediate_write_sz
. Data writes larger than this size will be
pushed directly to their final location and the ZIL will only include a
pointer to it.
Actually that's a lie. The real situation is rather more complicated.
First, if the data write is larger than the file's blocksize it is
always put into the on-disk ZIL (possibly because otherwise the ZIL
would have to record multiple pointers to its final location since it
will be split across multiple blocks, which could get complicated). Next,
you can set filesystems to have 'logbias=throughput
'; such a
filesystem writes all data blocks to their final locations (among other
effects). Finally, if you have a separate log device (with a normal
logbias
) data writes will always go into the log regardless of their
size, even for very large writes.
So in summary zfs_immediate_write_sz
only makes a difference if you
are using logbias=latency
and do not have a separate log device,
which can basically be summarized as 'if you have a normal pool without
any sort of special setup'. If you are using logbias=throughput
it
is effectively 0; if you have a separate log device it is effectively
infinite.
Update (October 13 2013): It turns out that this description is not quite complete. See part 2 for an important qualification.
Sidebar: slogs and logbias=throughput
Note that there is no point in having a separate log device and setting
logbias=throughput
on all of your filesystems, because the latter
makes the filesystems not use your slog. This is implicit in the
description of throughput
's behavior but may not be clear enough.
'Throughput' is apparently intended for situations where you want
to preserve your slog bandwidth and latency for filesystems where
ZIL commit latency is very important; you set everything else to
logbias=throughput
so that they don't touch the slog.
If you have an all-SSD pool with no slogs it may make sense to set
logbias=throughput
on everything in it. Seeks are basically free on
the SSDs and you'll probably wind up with less overall bandwidth to the
SSDs used since you're writing less data. Note that I haven't measured
or researched this.
ZFS transaction groups and the ZFS Intent Log
I've just been digging around in the depths of the ZIL and of ZFS transaction groups, so before I forget everything I've figured out I'm going to write it down (partly because when I went looking I couldn't find any really detailed information on this stuff). The necessary disclaimer is that all of this is as far as I can tell from my own research and code reading and thus I could be wrong about some of it.
Let's start with transaction groups. All write operations in ZFS are part of a transaction and every transaction is part of a 'transaction group' (a TXG in the ZFS jargon). TXGs are numbered sequentially and always commit in sequential order, and there is only one open TXG at any given time. Because ZFS immediately attaches all write IO to a transaction and thus a TXG, ZFS-level write operations cannot cross each other at the TXG level; if two writes are issued in order either they are both part of the same TXG or the second write is in a later TXG (and TXGs are atomic, which is the core of ZFS's consistency guarantees).
(An excellent long discussion of how ZFS transactions work is at the start of txg.c in the ZFS source.)
ZFS also has the ZIL aka the ZFS Intent Log. The ZIL exists because
of the journaling fsync()
problem:
you don't want to have to flush out a huge file just because someone
wanted to fsync()
a small one (that gets you slow fsync()
s and
unhappy people). Without some sort of separate log all ZFS could do to
force things to disk would be to immediately commit the entire current
transaction group, which drags all uncommitted write operations with it
whether or not they have anything to do with the file being fsync()
'd.
One of the confusing things about the ZIL is that it's common to talk about 'the ZIL' when this is not really the case. Each filesystem and zvol actually has its own separate ZIL which are all written to and recovered separately from each other (although if you have separate log devices the ZILs are all normally stored on the slog devices). We also need to draw a distinction between the on-disk ZIL and the in-memory 'ZIL' structure (implicitly for a particular dataset). The on-disk ZIL has committed records while the in-memory ZIL holds records that have not yet been committed (or expired because their TXG committed). A ZIL commit is the process of taking some or all of the in-memory ZIL records and flushing them to disk.
Because ZFS doesn't know in advance what's going to be fsync()
'd,
the in-memory ZIL holds a record of all write operations done to
the dataset. The ZIL has the concept of two sorts of write operations,
'synchronous' and 'asynchronous', and two sorts of ZIL commits,
general and file-specific. Sync writes are always committed when
the ZIL is committed; async writes are not committed if the ZIL is
doing a file-specific commit and they are for a different file. ZFS
metadata operations like creating or renaming files are synchronous
while data writes are generally but not always asynchronous. For
obvious reasons fsync()
does a file-specific ZIL commit, as do
the other ways of forcing synchronous write IO.
If the ZIL is active for a dataset the dataset no longer has strong
write ordering properties for data that is not explicitly flushed
to disk via fsync()
or the like. Because of a performance hack
for fsync()
this currently extends well beyond the obvious case
of writing one file, writing a second file, and fsync()
'ing the
second file; in some cases write data will be included in a ZIL
commit even though it has not been explicitly flushed.
(If you want the gory details, see: 1, 2, 3. This applies to all versions of ZFS, not just ZFS on Linux.)
ZIL records, both in memory and on disk, are completely separate from the transactions that are part of transaction groups and they're not read from either memory or disk in the process of committing a transaction group. In fact under normal operation on-disk ZIL records are never read at all. This can sometimes be a problem if you have separate ZIL log devices because nothing will notice if your log device is throwing away writes (or corrupting them) or can't actually read them back.
(I believe that pool scrubs do read the on-disk ZIL as a check but I'm not entirely sure.)
Modern versions of ZFS support a per-filesystem 'sync=
' property.
What I've described above is the behavior of the 'default' setting for
it. A setting of 'always' forces a ZIL commit on every write operation
(and as a result has a strong write order guarantee). A setting of
'disabled' disables ZIL commits but not the in-memory ZIL, which
will continue to accumulate records between TXG commits and then drop
the records when a TXG commits. A filesystem with 'sync=disabled
'
actually has stronger write ordering guarantees than a filesystem with
the ZIL enabled, at the cost of lying to applications about whether data
actually is solidly on disk at all (in some cases this may be okay).
(Presumably one reason for keeping the in-memory ZIL active for
sync=disabled
is so that you can change this property back and have
fsync()
immediately start doing the right thing.)
Under some circumstances the on-disk ZIL uses clever optimizations so
that it doesn't have to write out two copies of large write()
s (one to
the ZIL log for a ZIL commit and then a second to the regular ZFS pool
data structures as part of the TXG commit). A discussion of exactly
how this works is beyond the scope of this entry,
which is already long enough as it is.
(There is a decent comment discussing some more details of the ZIL at the start of zil.c.)