== The ZFS ZIL's optimizations for data writes In [[yesterday's entry on the ZIL ZFSTXGsAndZILs]] I mentioned that the ZIL has some clever optimizations for large _write()_s. To understand these (and some related ZFS filesystem properties), let's start with the fundamental problem. A simple, straightforward filesystem journal simply includes a full copy of each operation or transaction that it's recording. Many of these full copies will be small (for metadata operations like file renames), but for data writes you need to include the data being written. Now suppose that you are routinely writing a lot of data and then _fsync()_'ing it. This will wind up with the filesystem writing two copies of that large data, one copy recorded with the journal and then a second copy written to the actual live filesystem. This is inefficient and worse, it costs you both disk seeks (between the location of the journal and the final location of data) and write bandwidth. Because ZFS is a copy-on-write filesystem where old data is never overwritten in place, it can optimize this process in a straightforward way. Rather than putting the new data into the journal it can directly write the new data to its final (new) location in the filesystem and then simply record that new location in the journal. However, this is now a tradeoff; in exchange for not writing the data twice you're forcing the journal commit to wait for a separate (and full) data write, complete with an extra seek between the journal and the final location of the data. For sufficiently small amounts of data this tradeoff is not worth it and you're better off just writing an extra copy of the data to the journal without waiting. In ZFS, this division point is set by the global tuneable variable ((zfs_immediate_write_sz)). Data writes larger than this size will be pushed directly to their final location and the ZIL will only include a pointer to it. Actually that's a lie. The real situation is rather more complicated. First, if the data write is larger than the file's blocksize it is always put into the on-disk ZIL (possibly because otherwise the ZIL would have to record multiple pointers to its final location since it will be split across multiple blocks, which could get complicated). Next, you can set filesystems to have '_logbias=throughput_'; such a filesystem writes all data blocks to their final locations (among other effects). Finally, if you have a separate log device (with a normal _logbias_) data writes will always go into the log regardless of their size, even for very large writes. So in summary ((zfs_immediate_write_sz)) only makes a difference if you are using _logbias=latency_ and do not have a separate log device, which can basically be summarized as 'if you have a normal pool without any sort of special setup'. If you are using _logbias=throughput_ it is effectively 0; if you have a separate log device it is effectively infinite. ~~Update (October 13 2013)~~: It turns out that this description is not quite complete. See [[part 2 ZFSWritesAndZILII]] for an important qualification. === Sidebar: slogs and _logbias=throughput_ Note that there is no point in having a separate log device and setting _logbias=throughput_ on all of your filesystems, because the latter makes the filesystems not use your slog. This is implicit in the description of _throughput_'s behavior but may not be clear enough. 'Throughput' is apparently intended for situations where you want to preserve your slog bandwidth and latency for filesystems where ZIL commit latency is very important; you set everything else to _logbias=throughput_ so that they don't touch the slog. If you have an all-SSD pool with no slogs it may make sense to set _logbias=throughput_ on everything in it. Seeks are basically free on the SSDs and you'll probably wind up with less overall bandwidth to the SSDs used since you're writing less data. Note that I haven't measured or researched this.