Revisiting ZFS's ZIL, separate log devices, and writes

June 14, 2025

Many years ago I wrote a couple of entries about ZFS's ZIL optimizations for writes and then an update for separate log devices. In completely unsurprising news, OpenZFS's behavior has changed since then and gotten simpler. The basic background for this entry is the flow of activity in the ZIL (ZFS Intent Log).

When you write data to a ZFS filesystem, your write will be classified as 'indirect', 'copied', or 'needcopy'. A 'copied' write is immediately put into the in-memory ZIL even before the ZIL is flushed to disk, a 'needcopy' write will be put into the in-memory ZIL if a (filesystem) sync() or fsync() happens and then written to disk as part of the ZIL flush, and an 'indirect' write will always be written to its final place in the filesystem even if the ZIL is flushed to disk, with the ZIL just containing a pointer to the regular location (although at that point the ZIL flush depends on those regular writes). ZFS keeps metrics on how much you have of all of these, and they're potentially relevant in various situations.

As of the current development version of OpenZFS (and I believe for some time in released versions), how writes are classified is like this, in order:

  1. If you have 'logbias=throughput' set or the write is an O_DIRECT write, it is an indirect write.
  2. If you don't have a separate log device and the write is equal to or larger than zfs_immediate_write_sz (32 KBytes by default), it is an indirect write.

  3. If this is a synchronous write, it is a 'copied' write, including if your filesystem has 'sync=always' set.

  4. Otherwise it's a 'needcopy' write.

If your system is doing normal IO (well, normal writes) and you don't have a separate log device, large writes are indirect writes and small writes are 'needcopy' writes. This keeps both of them out of the in-memory ZIL. However, on our systems I see a certain volume of 'copied' writes, suggesting that some programs or ZFS operations force synchronous writes. This seems to be especially common on our ZFS based NFS fileservers, but it happens to some degree even on the ZFS fileserver that mostly does local IO.

The corollary to this is that if you do have a separate log device and you don't do O_DIRECT writes (and don't set logbias=throughput), all of your writes will go to your log device during ZIL flushes, because they'll fall through the first two cases and into case three or four. If you have a sufficiently high write volume combined with ZIL flushes, this may increase the size of a separate log device that you want and also make you want one that has a high write bandwidth (and can commit things to durable storage rapidly).

(We don't use any separate log devices for various reasons and I don't have well informed views of when you should use them and what sort of device you should use.)

Once upon a time (when I wrote my old entry), there was a zil_slog_limit tunable that pushed some writes back to being indirect writes even if you had a separate log device, under somewhat complex circumstances. That was apparently removed in 2017 and was partly not working even before then (also).

Written on 14 June 2025.
« Will (more) powerful discrete GPUs become required in practice in PCs?
My views on the choice of name for SMTP senders to use in TLS SNI »

Page tools: View Source.
Search:
Login: Password:

Last modified: Sat Jun 14 22:27:51 2025
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.