2022-09-20
Why the ZFS ZIL's "in-place" direct writes of large data are safe
I recently read ZFS sync/async + ZIL/SLOG, explained (via), which reminded me that there's a clever but unsafe seeming thing that ZFS does here, that's actually safe because of how ZFS works. Today, I'm going to talk about why ZFS's "in-place" direct writes to main storage for large synchronous writes are safe, despite that perhaps sounding dangerous.
ZFS periodically flushes writes to disk as part of a ZFS transaction group; these days a transaction group commit happens every five seconds by default. However, sometimes programs want data to be sent to disk sooner than that (for example, your editor saving a file; it will sync the file in one way or another at the end, so that you don't lose it if there's a system crash immediately afterward). To do this, ZFS has the ZFS Intent Log (ZIL), which is a log of all write operations since the last transaction group where ZFS promised programs that the writes were durably on disk (to simplify a bit). If the system crashes before the writes can be sent to disk normally as part of a transaction group, ZFS can replay the ZIL to recreate them.
Taken by itself, this means that ZFS does synchronous writes twice, once to the ZIL as part of making them durable and then a second time as part of a regular transaction group. As an optimization, under the right circumstances (which are complicated, especially with a separate log device) ZFS will send those synchronous writes directly to their final destination in your ZFS pool, instead of to the ZIL, and then simply record a pointer to the destination in the ZIL. This sounds dangerous, since you're writing data directly into the filesystem (well, the pool) instead of into a separate log, and in a different filesystem it might be. What makes it safe in ZFS is that in ZFS, all writes go to unused (free) disk space because ZFS is what we generally call a copy-on-write system. Even if you're rewriting bits of an existing file, ZFS writes the new data to free space, not over the existing file contents (and it does this whether or not you're doing a synchronous write).
(ZFS does have to update some metadata in place, but it's a small amount of metadata and it's carefully ordered to make transaction group commits atomic. When doing these direct writes, ZFS also flushes your data to disk before it writes the ZIL that points to your data.)
Obviously, ZFS makes no assumptions about the contents of free disk space. This means that if your system crashes after ZFS has written your synchronous data into its final destination in what was free space until ZFS used it just now, but before it writes out a ZIL entry for it (and tells your editor or database that the data is safely on disk), no harm is done. No live data has been overwritten, and the change to what's in free space is unimportant (well, to ZFS, you may care a lot about the contents of the file that you were just a little bit late to save as power died).
Similarly, if your system crashes after the ZIL is written out but before the regular transaction group commits, the space your new data is written to is still marked as free at the regular ZFS level but the ZIL knows better. When the ZIL is replayed to apply all of the changes it records, your new data will be correctly connected to the overall ZFS pool (meta)data structures, the space will be marked as used, and so on.
(I've mentioned this area in the past when I wrote about the ZIL's optimizations for data writes, but at the time I explained its safety more concisely and somewhat in passing. And the ZFS settings and specific behavior I mentioned in that entry may now be out of date, since it's from almost a decade ago.)