The ZFS ZIL's optimizations for data writes
In yesterday's entry on the ZIL I mentioned that the
ZIL has some clever optimizations for large write()
s. To understand
these (and some related ZFS filesystem properties), let's start with the
fundamental problem.
A simple, straightforward filesystem journal simply includes a full copy
of each operation or transaction that it's recording. Many of these full
copies will be small (for metadata operations like file renames), but
for data writes you need to include the data being written. Now suppose
that you are routinely writing a lot of data and then fsync()
'ing it.
This will wind up with the filesystem writing two copies of that large
data, one copy recorded with the journal and then a second copy written
to the actual live filesystem. This is inefficient and worse, it costs
you both disk seeks (between the location of the journal and the final
location of data) and write bandwidth.
Because ZFS is a copy-on-write filesystem where old data is never
overwritten in place, it can optimize this process in a straightforward
way. Rather than putting the new data into the journal it can directly
write the new data to its final (new) location in the filesystem and
then simply record that new location in the journal. However, this
is now a tradeoff; in exchange for not writing the data twice you're
forcing the journal commit to wait for a separate (and full) data write,
complete with an extra seek between the journal and the final location
of the data. For sufficiently small amounts of data this tradeoff is not
worth it and you're better off just writing an extra copy of the data to
the journal without waiting.
In ZFS, this division point is set by the global tuneable variable
zfs_immediate_write_sz
. Data writes larger than this size will be
pushed directly to their final location and the ZIL will only include a
pointer to it.
Actually that's a lie. The real situation is rather more complicated.
First, if the data write is larger than the file's blocksize it is
always put into the on-disk ZIL (possibly because otherwise the ZIL
would have to record multiple pointers to its final location since it
will be split across multiple blocks, which could get complicated). Next,
you can set filesystems to have 'logbias=throughput
'; such a
filesystem writes all data blocks to their final locations (among other
effects). Finally, if you have a separate log device (with a normal
logbias
) data writes will always go into the log regardless of their
size, even for very large writes.
So in summary zfs_immediate_write_sz
only makes a difference if you
are using logbias=latency
and do not have a separate log device,
which can basically be summarized as 'if you have a normal pool without
any sort of special setup'. If you are using logbias=throughput
it
is effectively 0; if you have a separate log device it is effectively
infinite.
Update (October 13 2013): It turns out that this description is not quite complete. See part 2 for an important qualification.
Sidebar: slogs and logbias=throughput
Note that there is no point in having a separate log device and setting
logbias=throughput
on all of your filesystems, because the latter
makes the filesystems not use your slog. This is implicit in the
description of throughput
's behavior but may not be clear enough.
'Throughput' is apparently intended for situations where you want
to preserve your slog bandwidth and latency for filesystems where
ZIL commit latency is very important; you set everything else to
logbias=throughput
so that they don't touch the slog.
If you have an all-SSD pool with no slogs it may make sense to set
logbias=throughput
on everything in it. Seeks are basically free on
the SSDs and you'll probably wind up with less overall bandwidth to the
SSDs used since you're writing less data. Note that I haven't measured
or researched this.
|
|