An interesting yet ordinary consequence of ZFS using the ZIL

March 28, 2023

On the Fediverse, Alan Coopersmith recently shared this:

@bsmaalders @cks writing a temp file and renaming it also avoids the failure-to-truncate issues found in screenshot cropping tools recently (#aCropalypse), but as some folks at work recently discovered, you need to be sure to fsync() before the rename, or a failure at the wrong moment can leave you with a zero-length file instead of the old one as the directory metadata can get written before the file contents data on ZFS.

On the one hand, this is perfectly ordinary behavior for a modern filesystem; often renames are synchronous and durable, but if you create a file, write it, and then rename it to something else, you haven't insured that the data you wrote is on disk, just that the renaming is. On the other hand, as someone who's somewhat immersed in ZFS this initially felt surprising to me, because ZFS is one of the rare filesystems that enforces a strict temporal order on all IO operations in its core IO model of ZFS transaction groups.

How this works is that everything that happens in a ZFS filesystem goes into a transaction group (TXG). At any give time there's only one open TXG and TXGs commit in order, so if B is issued after A, either it's in the same TXG as A the two happen together or it's in a TXG after A and so A has already happened. In transaction groups, you can never have B happen but A not happen. In the TXG mental model of ZFS IO, this data loss is impossible, since the rename happened after the data write.

However, all of this strict TXG ordering goes out the window once you introduce the ZFS Intent Log (ZIL), because the ZIL's entire purpose is to persist selected operations to disk before they're committed as part of a transaction group. Renames and file creations always go in the ZIL (along with various other metadata operations), but file data only goes in the ZIL if you fsync() it (this is a slight simplification, and file data isn't necessarily directly in the ZIL).

So once the ZIL was in my mental model I could understand what had happened. In effect the presence of the ZIL had changed ZFS from a filesystem with very strong data ordering properties to one with more ordinary ones, and in such a more ordinary filesystem you do need to fsync() your newly written file data to make it durable.

(And under normal circumstances ZFS always has the ZIL, so I was engaging in a bit of skewed system programmer thinking.)

Written on 28 March 2023.
« Moving from 'master' to 'main' in Git with local changes
The case of the very wrong email Content-Transfer-Encoding »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Mar 28 22:48:43 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.