Revisiting some bits of ZFS's ZIL with separate log devices

October 13, 2013

Back in this entry I described how the ZFS ZIL may or may not put large writes into the ZIL itself depending on various factors (how big they are, whether you're using a separate ZIL device, and so on). It turns out that I missed one potentially important factor, in fact one that affects more than large writes.

If you're using a separate log device, ZFS will normally put all write data into the ZIL (on the presumption that flushing data to the SLOG is faster than flushing it to the regular pool) and will then put the ZIL on your separate log device (unless you've turned this off with the logbias property). However this only applies if the log is not 'too big'.

What's 'too big'? That's the tunable zil_slog_limit, expressed in bytes, but how it gets used is a little bit obscure. First, let's backtrack to the overall ZIL structure. Each on disk ZIL is made up from some number of ZIL commits; these commits clean out over time as transaction groups push things into stable storage on the pool. This gives us two sizes: the size of the current ZIL commit that's being prepared and the total size of the (active) on disk ZIL at the moment.

What zil_slog_limit does is turn off use of the SLOG for large ZIL commits or large total ZIL log sizes. If the current ZIL commit is over zil_slog_limit or the current total ZIL log size is over twice zil_slog_limit, the ZIL commit is not written to your SLOG device but instead is written into the main pool. The default value of this tunable appears to be only one megabyte, which really startles me.

But wait, things get more fun. In ZFSWritesAndZIL I described how large writes are put directly into the ZIL if you have a separate log device, on the presumption that your SLOG is much faster than your actual disks. That decision is independent from the decision of whether your ZIL commit will be written to the SLOG or to your real disks (really, the code only checks 'does this have a SLOG?'). It appears to be quite possible to have a SLOG, have relatively large writes be put into a ZIL commit, and then have this ZIL commit written (relatively slowly) to your real disks instead of to your SLOG. You probably don't want this.

In a world where SLOG SSDs were tiny and precious, this may have made some sense. In a world where 60 GB SSDs are common as grass it's my opinion that this no longer really does in most environments. Most ZFS environments with SLOG SSDs will never come close to filling the SSD with active ZIL log entries because almost no one writes and fsync()s that much data that fast (you can and should measure this for yourself, of course, but this is the typical result). Raising zil_slog_limit substantially seems like a good idea to me (we'll probably tune it up to at least a gigabyte).

(See here for a nice overview of what gets written where and when and also some discussions about what may be faster under various circumstances.)

Written on 13 October 2013.
« Some pain points of parsing wikitext (and simplifications that avoid them)
The importance of small UI tweaks (for me), dmenu edition »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Oct 13 00:35:26 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.