2015-10-11
Bad news about how we detect and recover from NFS server problems
In a comment on this entry, Sandip Bhattacharya asked me:
Also, sometimes transient NFS server issues can cause the NFS mount to be wedged, where any access to the NFS mount hangs the process. How would do you escape or detect such conditions?
This is a good question in general and I am afraid the bad news is that there don't seem to be any good answers. Our usual method of 'detecting' such problems is that a succession of machines start falling over with absurd load averages; generally this is our central mailer, our primary IMAP server, our web server, and our most heavily used login server. This is of course not entirely satisfactory, but doing better is hard. Client kernels will generally start spitting out 'NFS server <X> not responding, still trying' messages somewhat before they keel over from excess load and delays, but you can have temporary blips of these messages even without server problems and on top of that you'd need very fast response before active machines start getting into bad situations.
(A web server is an especially bad case, since it keeps getting new requests all the time. If processes are stalling on IO, it doesn't take very much time before your server is totally overwhelmed. Many other servers at least don't spawn new activity quite so fast.)
As far as escaping the situation, well, again we haven't found any good solutions. If we're really lucky, we can catch a situation early enough that we can unmount currently unused and thus not-yet-wedged NFS filesystems from our clients. Unfortunately this is rare and doesn't help the really active machines. In theory clients offer ways to force NFS unmounts; in practice this has often not worked for us (on Linux) for actively used NFS filesystems. Generally we have to either get the NFS server to start working again (perhaps by rebooting the server) or force client reboots, after which they won't NFS mount stuff from the bad server.
(If a NFS server is experiencing ongoing or repeated problems, sometimes we can reboot it and have it return to good service long enough to unmount all of its filesystems on clients.)
In theory, you can fake a totally lost NFS server by having another NFS server take over the IP address so that at least clients will get 'permission denied, filesystem not exported' errors instead of no replies at all. In practice, this can run into serious client issues with the handling of stale NFS mounts so you probably don't want to do this unless you've already tested the result and know it isn't going to blow up in your face.
The whole situation with unresponsive NFS servers has been a real problem for as long as NFS has existed, but so far no one seems to have come up with good client-side solutions to make detecting and managing problems easier. I suspect one reason for this is that NFS servers are generally very reliable, which doesn't give people much motive to create complicated solutions for when they aren't.
(For reasons covered here, I feel that an automounter is not the answer to this problem in most cases. Anyways, we have our own NFS mount management solution.)