Caches should be safe by default
I've been looking at disk read caching systems recently. Setting aside my other issues, I've noticed something about basically all of them that makes me twitch as a sysadmin. I will phrase it this way:
Caches should be safe by default.
By 'safe' I mean that if your cache device dies, you should not lose data or lose your filesystem. Everything should be able to continue on, possibly after some manual adjustment. The broad purpose of most caches is to speed up reads; write accelerators are a different thing and should be labeled as such. When your new cache system is doing this for you, it should not be putting you at a greater risk of data loss because of some choice it's quietly made; if writes touch the cache at all, it should default to write-through mode. To do otherwise is just as dangerous as those databases that achieve great speed through quietly not really committing their data to disk.
There is a corollary that follows from this:
Caches should clearly document when and how they aren't safe.
After I've read a cache's documentation, I should not be in either doubt or ignorance about what will or can happen if the cache device dies on me. If I am, the documentation has failed. Especially it had better document the default configuration (or the default examples or both), because the default configuration is what a lot of people will wind up using. As a corollary to the corollary, the cache documentation should probably explain what I get for giving up safety. Faster than normal writes? It's just required by the cache's architecture? Avoiding a write slowdown that the caching layer would otherwise introduce? I'd like to know.
(If documenting this stuff makes the authors of the cache embarrassed, perhaps they should fix things.)
As a side note, I think it's fine to offer a cache that's also a write accelerator. But I think that this should be clearly documented, the risks clearly spelled out, and it should not be the default configuration. Certainly it should not be the silent default.
A consequence of NFS locking and unlocking not necessarily being fast
A while back I wrote Cross-system NFS locking and unlocking is not necessarily fast, about one drawback of using NFS locks to communicate between processes on different machines. This drawback is that it may take some time for process B on machine 2 to find out that process A on machine 1 has unlocked the shared coordination file. It turns out that this goes somewhat further than I realized at the time. Back then I looked at cross-system lock activity, but it turns out that you can see long NFS lock release delays even when the two processes are on the same machine.
If you have process A and process B on the same machine, both contending for access to the same file via file locking, you can easily see significant delays between active process A releasing the lock and waiting process B being notified that it now has the lock. I don't know enough about the NLM protocol to know if the client or server kernels can do anything to make the process go faster, but there are some client/server combinations where this delay does happen.
(If the client's kernel is responsible for periodically retrying pending locking operations until they succeed, it certainly could be smart enough to notice that another process on the machine just released a lock on the file and so now might be a good time for a another retry.)
This lock acquisition delay can have a pernicious spiraling effect on an overall system. Suppose, not entirely hypothetically, that what a bunch of processes on the same machine are locking is a shared log file. Normally a process spends very little time doing logging and most of their time doing other work. When they go to lock the log file to write a message, there's no contention, they get the lock, they basically immediately release the lock, and everyone goes on fine. But then you hit a lock collision, where processes A and B both want to write. A wins, writes its log message, and unlocks immediately. But the NFS unlock delay means that process B is then going to sit there for ten, twenty, or thirty seconds before it can do its quick write and release the lock in turn. Suppose during this time another process, C, also shows up to write a log message. Now C may be waiting too, and it too will have a big delay to acquire the lock (if locks are 'fair', eg FIFO, then it will have to wait both for B to get the lock and for the unlock delay after B is done). Pretty soon you have more and more processes piling up waiting to write to the log and things grind to a much slower pace.
I don't think that there's a really good solution to this for NFS, especially since an increasing number of Unixes are making all file locks be NFS aware (as we've found out the hard way before). It's hard to blame the Unixes that are doing this, and anyways the morally correct solution would be to make NLM unlocks wake up waiting people faster.
PS: this doesn't happen all of the time. Things are apparently somewhat variable based on the exact server and client versions involved and perhaps timing issues and so on. NFS makes life fun.
Sidebar: why the NFS server is involved here
Unless the client kernel wants to quietly transfer ownership of the lock being unlocked to another process instead of actually releasing it, the NFS server is the final authority on who has a NFS lock and it must be consulted about things. For all that any particular client machine knows, there is another process on another machine that very occasionally wakes up, grabs a lock on the log file, and does stuff.
Quietly transferring lock ownership is sleazy because it bypasses any other process trying to get the lock on another machine. One machine with a rotating set of processes could unfairly monopolize the lock if it did that.