2016-04-17
Why Unix needs a standard way to deal with the file durability problem
One of the reactions to my entry on Unix's file durability problem is the obvious pragmatic one. To wit, that this
isn't really a big problem because you can just look up what you
need to do in practice and do it (possibly with some debate over
whether you still need to fsync() the containing directory to
make new files truly durable or whether that's just superstition
by now). I don't disagree with this pragmatic answer and it's
certainly what you need to do today, but I think to stick to it is
to not see why Unix as a whole should have some sort of agreed on
standard for this.
An agreed on standard would help both programmers and kernel
developers. On the side of user level programmers, it tells us not
just what we need to do in order to achieve file durability today
but also what we need to do in order to future-proof our code. A
standard amounts to a promise that no sane future Unix setup will
add an additional requirement for file durability. If our code is
working right today on Solaris UFS or Linux ext2, it will keep
working right tomorrow on Linux ext4 or Solaris ZFS. Without a
standard, we can't be sure about this and in fact some programs
have been burned by it in the past, when new filesystems added extra
requirements like fsync()'ing directories under some circumstances.
(This doesn't mean that all future Unix setups will abide by this, of course. It just means that we can say 'your system is clearly broken, this is your problem and not a fault in our code, fix your system setup'. After all, even today people can completely disable file durability through configuration choices.)
On the side of kernel people and filesystem developers, it tells both parties how far a sensible filesystem can go; it becomes a 'this far and no further' marker for filesystem write optimization. Filesystem developers can reject proposed features that break the standard as 'it breaks the standard', and if they don't the overall kernel developers can. Filesystem development can entirely avoid both a race to the bottom and strained attempts to read the POSIX specifications so as to allow ever faster but more dangerous behavior (and also the ensuing arguments over just how one group of FS developers read POSIX).
The whole situation is exacerbated because POSIX and other standards have so relatively little to say on this. The people who create hyper-aggressive C optimizers are at least relying on a detailed and legalistically written C standard (even if almost no programs are fully conformant to it in practice), and so they can point users to chapter and verse on why their code is not standards conforming and so can be broken by the compiler. The filesystem people are not so much on shakier ground as on fuzzy ground, which results in much more confusion, disagreement, and arguing. It also makes it very hard for user level programmers to predict what future filesystems might require here, since they have so little to go from.
2016-04-14
Unix's file durability problem
The core Unix API is overall a reasonably well put together programming environment, one where you can do what you need and your questions have straightforward answers. It's not complete by any means and some of the practical edges are rough as a result of that, but the basics are solid. Well. Most of the basics.
One area where the Unix API really falls down on is the simple question of how to make your file writes be durable. Unix will famously hold your writes in RAM for an arbitrary length of time in the interests of performance. Often this is not quite what you want, as there are plenty of files that you rather want to survive a power loss, abrupt system crash, or the like. Unfortunately, how you make Unix put your writes on disk is what can charitably be called 'underspecified'. The uncharitable would call it a swamp.
The current state of affairs is that it's rather difficult to know
how to reliably and portably flush data to disk. Both superstition
and uncertainty abound. Do you fsync() or fdatasync() the file?
Do you need to fsync() the directory? Are there any extra steps?
Do you maybe need to fsync() the parent of the directory too? Who
knows for sure.
One issue is that unlike many other Unix API issues, it's impossible to test to see if you got it all correct and complete. If your steps are incomplete, you don't get any errors; your data is just silently sometimes at risk. Even with a test setup to create system crashes or abrupt power loss (which VMs make much easier), you need uncommon instrumentation to know things like if your OS actually issued disk flushes or just did normal buffered writes. And straightforward testing can't tell you if what you're doing will work all the time, because what is required varies by Unix, kernel version, and the specific filesystem involved.
Part of the problem is that any number of filesystem authors have taken advantage of POSIX's weak wording and how nothing usually goes wrong in order to make their filesystems perform faster (most of the time). It's clear why they do this; the standard is underspecified, people run filesystems against each other and reward the fastest ones, and testing actual durability is fiendishly hard so no one bothers. When actual users lose data, filesystem authors have historically behaved a great deal like the implementors of C compiler optimizations; they find some wording that justifies their practice of not flushing, explain how it makes their filesystem faster for almost everyone, and then blame the software authors for not doing the right magic steps to propitiate the filesystem.
(How people are supposed to know what the right steps are is left carefully out of scope for filesystem authors. That's someone else's job.)
This issue is not unsolvable at a technical level, but it probably is at a political level. Someone would have to determine and write up what is good enough now (on sane setups), and then Unix kernel people would have to say 'enough, we are not accepting changes that break this de facto standard'. You might even get this into the Single Unix Specification in some form if you tried hard, because I really do think there's a need here.
I'll admit that one reason I'm unusually grumpy about this is that I feel rather unhappy not knowing what I need to do to safeguard data that I care about. I could do my best, write code in accordance with my best understanding, and still lose data in a crash because I'd missed some corner case or some new additional requirement that filesystem people have introduced. Just the thought of it is alarming. And of course at the same time I'm selfish, because I want my filesystem activity to go as fast as it can and I'm not going to do 'crazy' things like force lots of IO to be synchronous. In this I'm implicitly one of the people pushing filesystem implementors to find those tricks that I wind up ranting about later.
2016-04-06
What is behind Unix's 'Text file is busy' error
Perhaps you have seen this somewhat odd Unix error before:
# cp prog /usr/local/bin/prog cp: cannot create regular file 'prog': Text file is busy
This is not just an unusual error message, it's also a rare instance
of Unix being friendly and not letting you blow your foot off with
a perfectly valid operation that just happens to be (highly) unwise.
To understand it, let's first work out what exact operation is
failing. I'll do this with strace on Linux, mostly because it's
what I have handy:
$ cp /usr/bin/sleep /tmp/
$ /tmp/sleep 120 &
$ strace cp /usr/bin/sleep /tmp/
[...]
open("/usr/bin/sleep", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0755, st_size=32600, ...}) = 0
open("/tmp/sleep", O_WRONLY|O_TRUNC) = -1 ETXTBSY (Text file busy)
[...]
There we go. cp is failing when it attempts to open /tmp/sleep
for writing and truncate it, which we have a running program, and
the specific Unix errno value here is ETXTBSY. If you experiment
some more you'll discover that we're allowed to remove /tmp/sleep
if we want to, just not write to it or truncate it (at least on
Linux; the specifics of what's disallowed may vary slightly on other
Unixes). This is an odd limitation for Unix, because normally there's
nothing that prevents one process from modifying a file out from
underneath another process (even in harmful ways). Unix leaves it
up to the program(s) involved to coordinate things between themselves,
rather than enforcing a policy of 'no writing if there are readers'
or something in the kernel.
But running processes are special, because really bad things usually happen if you modify the on-disk code of a running process. The problem is virtual memory, or more exactly paged virtual memory. On a system with paged virtual memory, programs aren't loaded into RAM all at once and then kept there; instead they're paged into RAM in bits and pieces as bits of code (and data) are needed. In fact, some times already-loaded bits and pieces are dropped from RAM in order to free up space, since they can always be loaded back in from disk.
Well, they can be loaded back in from disk if some joker hasn't gone and changed them on disk, at least. All of this paging programs into RAM in sections only works if the program's file on disk doesn't ever change while the program is running. If the kernel allowed running programs to change on disk, it could wind up loading in one page of code from version 1 of the program and another page from version 2. If you're lucky, the result would segfault. If you're unlucky, you might get silent malfunctions, data corruption, or other problems. So for once the Unix kernel does not let you blow your foot off if you really want to; instead it refuses to let you write to a program on disk if the program is running. You can truncate or overwrite any other sort of file even if programs are using it, just not things that are part of running programs. Those are special.
Given the story I've just told, you might expect ETXTBSY to
have appeared in Unix in or around 3BSD, which is more or less the
first version of Unix with paged virtual memory. However, this is
not the case. ETXTBSY turns out to be much older than BSD Unix,
going back to at least Research V5. Research Unix through V7 didn't
have paged virtual memory (it only swapped entire programs in and
out), but apparently the Research people decided to simplify their
lives by basically locking the files for executing programs against
modification.
(In fact Research Unix was stricter than modern Unixes, as it looks
like you couldn't delete a program's file on disk if it was running.
That section of the kernel code for unlink() gets specifically
commented out no later than 3BSD, cf.)
PS: the 'text' in 'text file' here actually means 'executable code',
per say size's output. Of course it's not just the actual executable
code that could be dangerous if it changed out from underneath a
running program, but there you go.
Sidebar: the way around this if you're updating running programs
To get around this, all you have to do is remove the old file before writing the new file into place. This (normally) doesn't cause any problems; the kernel treats the 'removed but still being used by a running program' executable the same way it treats any 'removed but still open' file. As usual the file is only actually removed when the last reference goes away, in this case the last process using the old executable exits.
(Of course NFS throws a small monkey wrench into things, sometimes in more than one way.)