Unix's file durability problem

April 14, 2016

The core Unix API is overall a reasonably well put together programming environment, one where you can do what you need and your questions have straightforward answers. It's not complete by any means and some of the practical edges are rough as a result of that, but the basics are solid. Well. Most of the basics.

One area where the Unix API really falls down on is the simple question of how to make your file writes be durable. Unix will famously hold your writes in RAM for an arbitrary length of time in the interests of performance. Often this is not quite what you want, as there are plenty of files that you rather want to survive a power loss, abrupt system crash, or the like. Unfortunately, how you make Unix put your writes on disk is what can charitably be called 'underspecified'. The uncharitable would call it a swamp.

The current state of affairs is that it's rather difficult to know how to reliably and portably flush data to disk. Both superstition and uncertainty abound. Do you fsync() or fdatasync() the file? Do you need to fsync() the directory? Are there any extra steps? Do you maybe need to fsync() the parent of the directory too? Who knows for sure.

One issue is that unlike many other Unix API issues, it's impossible to test to see if you got it all correct and complete. If your steps are incomplete, you don't get any errors; your data is just silently sometimes at risk. Even with a test setup to create system crashes or abrupt power loss (which VMs make much easier), you need uncommon instrumentation to know things like if your OS actually issued disk flushes or just did normal buffered writes. And straightforward testing can't tell you if what you're doing will work all the time, because what is required varies by Unix, kernel version, and the specific filesystem involved.

Part of the problem is that any number of filesystem authors have taken advantage of POSIX's weak wording and how nothing usually goes wrong in order to make their filesystems perform faster (most of the time). It's clear why they do this; the standard is underspecified, people run filesystems against each other and reward the fastest ones, and testing actual durability is fiendishly hard so no one bothers. When actual users lose data, filesystem authors have historically behaved a great deal like the implementors of C compiler optimizations; they find some wording that justifies their practice of not flushing, explain how it makes their filesystem faster for almost everyone, and then blame the software authors for not doing the right magic steps to propitiate the filesystem.

(How people are supposed to know what the right steps are is left carefully out of scope for filesystem authors. That's someone else's job.)

This issue is not unsolvable at a technical level, but it probably is at a political level. Someone would have to determine and write up what is good enough now (on sane setups), and then Unix kernel people would have to say 'enough, we are not accepting changes that break this de facto standard'. You might even get this into the Single Unix Specification in some form if you tried hard, because I really do think there's a need here.

I'll admit that one reason I'm unusually grumpy about this is that I feel rather unhappy not knowing what I need to do to safeguard data that I care about. I could do my best, write code in accordance with my best understanding, and still lose data in a crash because I'd missed some corner case or some new additional requirement that filesystem people have introduced. Just the thought of it is alarming. And of course at the same time I'm selfish, because I want my filesystem activity to go as fast as it can and I'm not going to do 'crazy' things like force lots of IO to be synchronous. In this I'm implicitly one of the people pushing filesystem implementors to find those tricks that I wind up ranting about later.

Written on 14 April 2016.
« How I'm trying to do durable disk writes here on Wandering Thoughts
Unbound illustrates the Unix manpage mistake with its ratelimits documentation »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Apr 14 00:36:46 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.