What NFS file-based locking problems can happen
Now that we know how NFS is unreliable, we can see what can go wrong when you attempt to do file-based locking over NFS. There are two failures, corresponding to the two ways for things to go wrong when a reply is lost:
- if you lose the server's reply to a successful attempt to acquire
the lock, your replay of the operation will report a failure even though
you actually own the lock. The result is effectively a deadlocked
system where the lock will never get released.
(This is one situation where the
lnbased style of locking is better than
mkdirstyle, because the lock file can contain some identifying information so you can check it to make sure that you don't own the lock after all.)
- if you lose the server's reply to your unlocking of the lock and then replay it, you can actually unlock someone else's lock, if they successfully acquired the lock between when the server did your unlock the first time and when you replay your unlock.
I'm not certain if there's any way around the second problem, apart from counting on the server's request/reply cache (and TCP). It doesn't help to check the data in the lock file before you unlock it, because the fatal replay happens in your machine's kernel before you have a chance to check it again. There's no way for the NFS server to detect that you're unlinking a different version of the file than you think you are, because the NFS unlock and rmdir operations only specify a name with no generation count or the like.
Actually there is a theoretical tricky way to sidestep the second problem: do the locking in a group-writeable directory with the sticky bit on, and make every machine run the program under a different UID. That way the server won't let you remove a lock file (or directory) that you don't own. And you can use the lock file's ownership to see what machine currently owns the lock.