How disk write caches can corrupt filesystem metadata

June 4, 2010

It's intuitively obvious how (volatile) disk write caches can result in you losing data in files if something goes wrong; you wrote data, the disk told your program it had been written, the disk lost power, and the data vanishes. But it may be less obvious how this can result in corrupted or destroyed filesystems and thus why you need (working) cache flush operations even just to keep your filesystems intact (never mind what user level programs may want).

Consider a filesystem where you have two pieces of metadata, A and B, where A points to B; A might be a directory and B a file inode, or A might be a file inode and B a table of block pointers. Since filesystem metadata is often some sort of tree, this sort of pointing is common (nodes higher up the tree point to nodes lower down). Now suppose that you are creating a new B (say you are adding a file to a directory). In order to keep the metadata consistent, you want to write things bottom first; you want to write the new B and then the new version of A.

(It's common to have several layers of pointing; A points to B which points to C which points to D and so on. In such cases you usually don't have to write each one by one, pausing before the next. Instead you just need everything else written, in some order, before you make the change visible by writing A.)

In theory disks with volatile write caches don't upset this; your metadata is still consistent if the disk loses power and neither A nor B get written. What breaks metadata consistency is that disks with write caches don't necessarily write things in order; it's entirely possible for a disk to cache both the B and A writes, then write A, then lose power with B unwritten. At this point you have A pointing to garbage. Boom. And disks with write caches are free to keep things unwritten for random but large amounts of time for their own inscrutable reasons (or very scrutable ones, such as 'A keeps getting written to').

(Note that copy-on-write filesystems are especially exposed to this, because they almost never update things in place and so are writing a lot of new B's and changing where the A's point. And the A is generally the very root of the filesystem, so if it points to nowhere you have a very bad problem.)

In the simple case you can get away with just a disk write barrier for metadata integrity, so that you can tell the disk that it can't write A before it's written B out. However, this isn't sufficient when you're dealing with multi-disk filesystems, where A may be on a different disk entirely than B. There you really do need to be able to issue a cache flush to B's disk and know that B has been written out before you queue up A's write on its disk. (Otherwise you could again have A written but not B, because B's disk lost power but A's did not.)

The multi-disk filesystem case is a specific example of the general case where write barriers aren't good enough: where you're interacting with the outside world, not just with things on the disk itself. Since all sorts of user level programs interact with the outside world, user programs generally need real 'it is on the disk' cache flush support.

(This is the kind of entry that I write to make sure I understand the logic so that I can explain it to other people. As usual, it feels completely obvious once I've written it out.)

Sidebar: write cache exposure versus disk redundancy

I believe that in a well implemented redundant filesystem, the filesystem's metadata consistency should survive so long as the filesystem can find a single good copy of B. For example if you have an N-way mirror, you're still okay even if N-1 disks all lose the write (such as by losing power simultaneously); you're only in trouble if all of them do. This may give you some reassurance even if you have disks that ignore or don't support cache flushes (apparently this includes many common SSDs, much to people's displeasure).

(In disk-level redundancy instead of filesystem-level redundancy you may have problems recognizing what's a good copy of B. Let's assume that you have ZFS-like checksums and so on.)

Of course, power loss events can be highly correlated across multiple disks (to put it one way). Especially if they're all in the same server.

Written on 04 June 2010.
« A ZFS feature wish: rewriting read errors
How to set up your module exceptions to be useful »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jun 4 00:22:42 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.