Caches should be safe by default

August 15, 2014

I've been looking at disk read caching systems recently. Setting aside my other issues, I've noticed something about basically all of them that makes me twitch as a sysadmin. I will phrase it this way:

Caches should be safe by default.

By 'safe' I mean that if your cache device dies, you should not lose data or lose your filesystem. Everything should be able to continue on, possibly after some manual adjustment. The broad purpose of most caches is to speed up reads; write accelerators are a different thing and should be labeled as such. When your new cache system is doing this for you, it should not be putting you at a greater risk of data loss because of some choice it's quietly made; if writes touch the cache at all, it should default to write-through mode. To do otherwise is just as dangerous as those databases that achieve great speed through quietly not really committing their data to disk.

There is a corollary that follows from this:

Caches should clearly document when and how they aren't safe.

After I've read a cache's documentation, I should not be in either doubt or ignorance about what will or can happen if the cache device dies on me. If I am, the documentation has failed. Especially it had better document the default configuration (or the default examples or both), because the default configuration is what a lot of people will wind up using. As a corollary to the corollary, the cache documentation should probably explain what I get for giving up safety. Faster than normal writes? It's just required by the cache's architecture? Avoiding a write slowdown that the caching layer would otherwise introduce? I'd like to know.

(If documenting this stuff makes the authors of the cache embarrassed, perhaps they should fix things.)

As a side note, I think it's fine to offer a cache that's also a write accelerator. But I think that this should be clearly documented, the risks clearly spelled out, and it should not be the default configuration. Certainly it should not be the silent default.

Written on 15 August 2014.
« A consequence of NFS locking and unlocking not necessarily being fast
The challenges of diagnosing slow backups »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Aug 15 22:40:48 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.