Revisiting some bits of ZFS's ZIL with separate log devices
Back in this entry I described how the ZFS ZIL may or may not put large writes into the ZIL itself depending on various factors (how big they are, whether you're using a separate ZIL device, and so on). It turns out that I missed one potentially important factor, in fact one that affects more than large writes.
If you're using a separate log device, ZFS will normally put all write
data into the ZIL (on the presumption that flushing data to the SLOG
is faster than flushing it to the regular pool) and will then put the
ZIL on your separate log device (unless you've turned this off with the
logbias property). However this only applies if the log is not 'too
What's 'too big'? That's the tunable
in bytes, but how it gets used is a little bit obscure. First, let's
backtrack to the overall ZIL structure. Each on disk
ZIL is made up from some number of ZIL commits; these commits clean out
over time as transaction groups push things into stable storage on the
pool. This gives us two sizes: the size of the current ZIL commit that's
being prepared and the total size of the (active) on disk ZIL at the
zil_slog_limit does is turn off use of the SLOG for large ZIL
commits or large total ZIL log sizes. If the current ZIL commit is
zil_slog_limit or the current total ZIL log size is over twice
zil_slog_limit, the ZIL commit is not written to your SLOG device
but instead is written into the main pool. The default value of this
tunable appears to be only one megabyte, which really startles me.
But wait, things get more fun. In ZFSWritesAndZIL I described how large writes are put directly into the ZIL if you have a separate log device, on the presumption that your SLOG is much faster than your actual disks. That decision is independent from the decision of whether your ZIL commit will be written to the SLOG or to your real disks (really, the code only checks 'does this have a SLOG?'). It appears to be quite possible to have a SLOG, have relatively large writes be put into a ZIL commit, and then have this ZIL commit written (relatively slowly) to your real disks instead of to your SLOG. You probably don't want this.
In a world where SLOG SSDs were tiny and precious, this may have made
some sense. In a world where 60 GB SSDs are common as grass it's my
opinion that this no longer really does in most environments. Most ZFS
environments with SLOG SSDs will never come close to filling the SSD
with active ZIL log entries because almost no one writes and
that much data that fast (you can and should measure this for yourself,
of course, but this is the typical result). Raising
substantially seems like a good idea to me (we'll probably tune it up to
at least a gigabyte).
(See here for a nice overview of what gets written where and when and also some discussions about what may be faster under various circumstances.)