2007-06-13
What NFS file-based locking problems can happen
Now that we know how NFS is unreliable, we can see what can go wrong when you attempt to do file-based locking over NFS. There are two failures, corresponding to the two ways for things to go wrong when a reply is lost:
- if you lose the server's reply to a successful attempt to acquire
the lock, your replay of the operation will report a failure even though
you actually own the lock. The result is effectively a deadlocked
system where the lock will never get released.
(This is one situation where the
lnbased style of locking is better thanmkdirstyle, because the lock file can contain some identifying information so you can check it to make sure that you don't own the lock after all.) - if you lose the server's reply to your unlocking of the lock and then replay it, you can actually unlock someone else's lock, if they successfully acquired the lock between when the server did your unlock the first time and when you replay your unlock.
I'm not certain if there's any way around the second problem, apart from counting on the server's request/reply cache (and TCP). It doesn't help to check the data in the lock file before you unlock it, because the fatal replay happens in your machine's kernel before you have a chance to check it again. There's no way for the NFS server to detect that you're unlinking a different version of the file than you think you are, because the NFS unlock and rmdir operations only specify a name with no generation count or the like.
Actually there is a theoretical tricky way to sidestep the second problem: do the locking in a group-writeable directory with the sticky bit on, and make every machine run the program under a different UID. That way the server won't let you remove a lock file (or directory) that you don't own. And you can use the lock file's ownership to see what machine currently owns the lock.
2007-06-12
'Argument list too long' is a misleading message
If you try to run a command with too many command line arguments, you
get an 'Argument list too long' error (technically you get an E2BIG
error from the kernel, which is then mapped into this error string by
your shell). There's a bunch of workarounds for this problem, such as
using xargs.
However, the error message is actually somewhat misleading. As old
Unix hands know (and new ones generally don't), the kernel's size
limit on exec() is on the combination of arguments and environment
variables. If you accidentally wind up with a huge environment and try
to start a program with even a few arguments, you'll fall over. (And if the
environment is big enough, you can't start any programs at all.)
Modern shells make this a more mysterious failure since they have so many built in commands, which means that you can get a substantial way into a shell script before you try to run an external command and fail. The net result is that you can spend a bunch of time scratching your head and trying to figure out why an innocent command with only a couple of arguments is getting this error.
(If you need to figure out what the big environment variables are in the
Bourne shell, it is useful to know that the export command without
any arguments lists all the environment variables, and it is a built-in
command so you can still use it in this situation. If you are trying to
use tcsh, you're on your own.)
Sidebar: why I'm so aware of this issue
Most people run into this issue only very rarely, because you need both a very large shell variable or three and to have them exported to the environment. Just pulling an unexpectedly large amount of data into a variable to process it won't hit this. However, I use a shell that automatically exports all shell variables into the environment, so any time I pull a big value into a shell variable I run into this.
My workaround is simple; when I run external commands I null the variable for the command, like so:
BIGVAR='' command ...
This is normally used to add some environment variables just for a command, but it works fine to take them away too.
(This also works in the Bourne shell, although you are much less likely to need it.)
2007-06-10
How NFS is unreliable for file-based locking
You hear a lot about how NFS is unreliable for file-based locking, but you rarely hear how and why, and understanding the details helps in understanding what can go wrong. The fundamental source of NFS's unreliability issue here is what I'll call the replay issue.
The communication between NFS clients and NFS servers is unreliable; both requests and replies can be dropped. Only the client worries about this, and it uses a simple approach: if it didn't get an answer to its request, it resends it.
If what was lost was the client's request, there's no problem. But if what was lost was the server's reply to the client's request, there's two potential problems. First, because some NFS operations are not idempotent an operating that got a successful result the first time will get a failure when retried. Second, an operation that is retried might be acting on a different version of an object than it thinks it is, because someone has modified the object in the mean time.
This is not a new or obscure issue; it was recognized quite early on in NFS's life. The general workaround is to add a request/reply cache to the NFS server, so that the NFS server can recognize when it gets a duplicate request and just send out another copy of the original reply. Since the cache has a finite size this isn't a sure cure, but in practice it works pretty well.
(NFS over TCP also helps, because the TCP layer makes things reliable unless you abort and reopen the TCP connection itself.)