2024-05-28
ZFS's transactional guarantees from a user perspective
I said recently on the Fediverse that ZFS's transactional guarantees were rather complicated both with and without fsync(). I've written about these before in terms of transaction groups and the ZFS Intent Log (ZIL), but that obscured the user visible behavior under the technical details. So here's an attempt at describing just the visible behavior, hopefully in a way that people can follow despite how it gets complicated.
ZFS has two levels of transactional behavior. The basic layer is what happens when you don't use fsync() (or the filesystem is ignoring it). At this level, all changes to a ZFS filesystem are strongly ordered by the time they happened. ZFS may lose some activity at the end, but if you did operation A before operation B and there is a crash, the possible options of what is there afterward is nothing, A, or A and B; you can never have B without A. This strictly time ordered view of filesystem changes is periodically flushed to disk by ZFS; in modern ZFS, such a flush is typically started every five seconds (although completing a flush can take some time). This is generally called a transaction group (txg) commit.
The second layer of transactional behavior comes in if you fsync() something. When you fsync() something (and fsync is enabled on the filesystem, which is the default), all uncommitted metadata changes are immediately flushed to disk along with whatever uncommitted file data changes you requested a fsync() for (if you fsync'd a file instead of a directory). If several processes request fsync()s at once, all of their requests will be merged together, so a single immediate flush may include data for multiple files. Uncommitted file changes that no one requested a fsync() for will not be immediately flushed and will instead wait for the next regular non-fsync() flush (the next txg commit).
(This is relatively normal behavior for fsync(), except that on most filesystems a fsync() doesn't immediately flush all metadata changes. Metadata changes include things like creating, renaming, or removing files.)
A fsync() can break the strict time order of ZFS changes that exists in the basic layer. If you write data to A, write data to B, fsync() B but not A, and ZFS crashes immediately, the data for B will still be there but the change to A may have been lost. In some situations this can result in zero length files even though they were intended to have data. However, if enough time goes by everything from before the fsync() will have been flushed out as part of the non-fsync() flush process.
As a technical detail, ZFS makes it so that all of the changes that are part of a particular periodic flush are tied to each other (if there have been no fsyncs to meddle with the ordering); either all of them will appear after a crash or none of them will. This can be used to create atomic groups of changes that will always appear together (or be lost together), by making sure that all changes are part of the same periodic flush (in ZFS jargon, they are part of the same transaction group (txg)). However, ZFS doesn't give programs any explicit way to do this, and this atomic grouping can be messed up if someone fsync()s at an inconvenient time.