Unix's fsync(), write ahead logs, and durability versus integrity

July 2, 2024

I recently read Phil Eaton's A write-ahead log is not a universal part of durability (via), which is about what it says it's about. In the process it discusses using Unix's fsync() to achieve durability, which woke up a little twitch I have about this general area, which is the difference between durability and integrity (which I'm sure Phil Eaton is fully aware of; their article was only about the durability side).

The core integrity issue of simple uses of fsync() is that while fsync() forces the filesystem to make things durable on disk, the filesystem doesn't promise to not write anything to disk until you do that fsync(). Once you write() something to the filesystem, it may write it to disk without warning at any time, and even during an fsync() the filesystem makes no promises about what order data will be written in. If you start an fsync() and the system crashes part way through, some of your data will be on disk and some won't be and you have no control over which part is which.

This means that if you overwrite data in place and use fsync(), the only time you are guaranteed that your data has both durability and integrity is in the time after fsync() completes and before you write any more data. Once you start (over)writing data again, that data could be partially written to disk even before you call fsync(), and your integrity could be gone. To retain integrity, you can't overwrite more than a tiny bit of data in place. Instead, you need to write data to a new place, fsync() it, and then overwrite one tiny piece of existing data to activate your new data (and fsync() that write too).

(Filesystems can use similar two-stage approaches to make and then activate changes, such as ZFS's slight variation on this. ZFS does not quite overwrite anything in place, but it does require multiple disk flushes, possibly more than two.)

The simplest version of this condenses things down to one fsync() (or its equivalent) at the cost of having an append-only data structure, which we usually call a log. Logs need their own internal integrity protection, so that they can tell whether or not a segment of the log had all of its data flushed to disk and so is fully valid. Once your single fsync() of a log append finishes, all of the data is on disk and that segment is valid; before the fsync finishes, it's not necessarily so. Only some of the data might have been written, and it might have been written out of order (so that the last block made it to disk but an earlier block did not).

Write ahead logs normally increases the amount of data written to disk; you write data once to the WAL and once to the main database. However, a WAL may well reduce the number of fsync()s (and thus disk flushes) that you have to do in order to have both durability and integrity. In modern solid state storage systems, synchronous disk flushes can be the slowest operation and (asynchronous) write bandwidth relatively plentiful, so trading off more data written for fewer disk flushes can be a net performance win in practice for plenty of workloads.

(Again, I'm sure Phil Eaton knows all of this; their article was specifically about the durability side of things. I'm using it as a springboard for additional thoughts. I'm not sure I'd realized how a WAL can reduce the number of fsync()s required before now.)

Written on 02 July 2024.
« Modifying and setting alarm times: a phone UI irritation
Fedora 40 and a natural but less than ideal outcome with 'alternatives' »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jul 2 22:41:15 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.