Always sync your log or journal files when you open them
Today I learned of a new way to accidentally lose data 'written' to disk, courtesy of this Fediverse post summarizing a longer article about CouchDB and this issue. Because this is so nifty and startling when I encountered it, yet so simple, I'm going to re-explain the issue in my own words and explain how it leads to the title of this entry.
Suppose that you have a program that makes data it writes to disk
durable through some form of journal, write ahead log (WAL), or the
like. As we all know, data that you simply write()
to the operating
system isn't yet on disk; the operating system is likely buffering
the data in memory before writing it out at the OS's own convenience.
To make the data durable, you must explicitly flush it to disk
(well, ask the OS to), for example with fsync()
. Your program is
a good program, so of course it does this; when it updates the WAL,
it write()
s then fsync()
s.
Now suppose that your program is terminated after the write but before the fsync. At this point you have a theoretically incomplete and improperly written journal or WAL, since it hasn't been fsync'd. However, when your program restarts and goes through its crash recovery process, it has no way to discover this. Since the data was written (into the OS's disk cache), the OS will happily give the data back to you even though it's not yet on disk. Now assume that your program takes further actions (such as updating its main files) based on the belief that the WAL is fully intact, and then the system crashes, losing that buffered and not yet written WAL data. Oops. You (potentially) have a problem.
(These days, programs can get terminated for all sorts of reasons other than a program bug that causes a crash. If you're operating in a modern containerized environment, your management system can decide that your program or its entire container ought to shut down abruptly right now. Or something else might have run the entire system out of memory and now some OOM handler is killing your program.)
To avoid the possibility of this problem, you need to always force
a disk flush when you open your journal, WAL, or whatever; on Unix,
you'd immediately fsync()
it. If there's no unwritten data, this
will generally be more or less instant. If there is unwritten data
because you're restarting after the program was terminated by
surprise, this might take a bit of time but insures that the on-disk
state matches the state that you're about to observe through the
OS.
(CouchDB's article points to another article, Justin Jaffray’s NULL BITMAP Builds a Database #2: Enter the Memtable, which has a somewhat different way for this failure to bite you. I'm not going to try to summarize it here but you might find the article interesting reading.)
|
|