The ZFS ZIL's optimizations for data writes

July 11, 2013

In yesterday's entry on the ZIL I mentioned that the ZIL has some clever optimizations for large write()s. To understand these (and some related ZFS filesystem properties), let's start with the fundamental problem.

A simple, straightforward filesystem journal simply includes a full copy of each operation or transaction that it's recording. Many of these full copies will be small (for metadata operations like file renames), but for data writes you need to include the data being written. Now suppose that you are routinely writing a lot of data and then fsync()'ing it. This will wind up with the filesystem writing two copies of that large data, one copy recorded with the journal and then a second copy written to the actual live filesystem. This is inefficient and worse, it costs you both disk seeks (between the location of the journal and the final location of data) and write bandwidth.

Because ZFS is a copy-on-write filesystem where old data is never overwritten in place, it can optimize this process in a straightforward way. Rather than putting the new data into the journal it can directly write the new data to its final (new) location in the filesystem and then simply record that new location in the journal. However, this is now a tradeoff; in exchange for not writing the data twice you're forcing the journal commit to wait for a separate (and full) data write, complete with an extra seek between the journal and the final location of the data. For sufficiently small amounts of data this tradeoff is not worth it and you're better off just writing an extra copy of the data to the journal without waiting. In ZFS, this division point is set by the global tuneable variable zfs_immediate_write_sz. Data writes larger than this size will be pushed directly to their final location and the ZIL will only include a pointer to it.

Actually that's a lie. The real situation is rather more complicated.

First, if the data write is larger than the file's blocksize it is always put into the on-disk ZIL (possibly because otherwise the ZIL would have to record multiple pointers to its final location since it will be split across multiple blocks, which could get complicated). Next, you can set filesystems to have 'logbias=throughput'; such a filesystem writes all data blocks to their final locations (among other effects). Finally, if you have a separate log device (with a normal logbias) data writes will always go into the log regardless of their size, even for very large writes.

So in summary zfs_immediate_write_sz only makes a difference if you are using logbias=latency and do not have a separate log device, which can basically be summarized as 'if you have a normal pool without any sort of special setup'. If you are using logbias=throughput it is effectively 0; if you have a separate log device it is effectively infinite.

Update (October 13 2013): It turns out that this description is not quite complete. See part 2 for an important qualification.

Sidebar: slogs and logbias=throughput

Note that there is no point in having a separate log device and setting logbias=throughput on all of your filesystems, because the latter makes the filesystems not use your slog. This is implicit in the description of throughput's behavior but may not be clear enough. 'Throughput' is apparently intended for situations where you want to preserve your slog bandwidth and latency for filesystems where ZIL commit latency is very important; you set everything else to logbias=throughput so that they don't touch the slog.

If you have an all-SSD pool with no slogs it may make sense to set logbias=throughput on everything in it. Seeks are basically free on the SSDs and you'll probably wind up with less overall bandwidth to the SSDs used since you're writing less data. Note that I haven't measured or researched this.

Written on 11 July 2013.
« ZFS transaction groups and the ZFS Intent Log
What we need in our fileservers (in the abstract) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jul 11 15:06:48 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.