ZFS pushes file renamings and other metadata changes to disk quite promptly

May 27, 2018

One of the general open questions on Unix is when changes like renaming or creating files are actually durably on disk. Famously, some filesystems on some Unixes have been willing to delay this for an unpredictable amount of time unless you did things like fsync() the containing directory of your renamed file, not just fsync() the file itself. As it happens, ZFS's design means that it offers some surprisingly strong guarantees about this; specifically, ZFS persists all metadata changes to disk no later than the next transaction group commit. In ZFS today, a transaction group commit generally happens every five seconds, so if you do something like rename a file, your rename will be fully durable quite soon even if you do nothing special.

However, this doesn't mean that if you create a file, write data to the file, and then rename it (with no other special operations) that in five or ten seconds your new file is guaranteed to be present under its new name with all the data you wrote. Although metadata operations like creating and renaming files go to ZFS right away and then become part of the next txg commit, the kernel generally holds on to written file data for a while before pushing it out. You need some sort of fsync() in there to force the kernel to commit your data, not just your file creation and renaming. Because of how the ZFS intent log works, you don't need to do anything more than fsync() your file here; when you fsync() a file, all pending metadata changes are flushed out to disk along with the file data.

(In a 'create new version, write, rename to overwrite current version' setup, I think you want to fsync() the file twice, once after the write and then once after the rename. Otherwise you haven't necessarily forced the rename itself to be written out. You don't want to do the rename before a fsync(), because then I think that a crash at just the wrong time could give you an empty new file. But the ice is thin here in portable code, including code that wants to be portable to different filesystem types.)

My impression is that ZFS is one of the few filesystems with such a regular schedule for committing metadata changes to disk. Others may be much more unpredictable, and possibly may reorder the commits of some metadata operations in the process (although by now, it would be nice if everyone avoided that particular trick). In ZFS, not only do metadata changes commit regularly, but there is a strict time order to them such that they can never cross over each other that way.

Written on 27 May 2018.
« Most modern web spiders are parasites
An incomplete list of reasons why I force-quit iOS apps »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun May 27 22:47:51 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.