How NFS v3 servers and clients re-synchronize locks after reboots
NFS (v3) is usually described as 'stateless', by which we mean that the NFS clients hold all of the state and in theory all the server does is answer all of their requests one by one (the actual reality is more messy). However, NFS (v3) locks are obviously not stateless, in that the server and all of the NFS clients have to agree on what is and isn't locked (and by who). This creates a need to re-synchronize this state if something unfortunate happens to either a NFS client or the NFS server, so you don't get stuck locks and other problems. The NFS v3 locking protocol opted to take a relatively brute force approach to the problem.
If and when a NFS v3 client boots up, it sends a 'I have just rebooted' notice to every NFS server it had locks from, or even perhaps might have had locks from. The NFS servers all react to this notice by releasing any NFS locks they believe the NFS client holds. In the traditional Unix model of locks, which NFS v3 more or less follows, locks are released no later than when the relevant processes exit, and on a reboot all processes have 'exited' (even if what really happened is that the NFS client lost power, locked up entirely, or had a kernel panic). As far as I know it's harmless for a NFS client to send this notice to a NFS server it doesn't actually have any locks from, so NFS clients can do very simple things to keep a persistent record of what NFS servers they locked things on.
Things are more complicated with NFS servers. When a NFS server boots or reboots, it sends out a special 'I have rebooted' message to all NFS clients that it gave locks to, which causes all of the NFS clients to re-acquire those locks from the NFS server. However, there's a complication, because nothing prevents NFS clients from asking for new locks, including locks on files that were theoretically already locked by another client that hasn't yet reclaimed them. To prevent this from happening, a NFS v3 server that has rebooted enters a special reclaim locking mode for what is called a grace period. When a NFS client is reclaiming a lock in response to a server's notice, it sets a special 'this is a reclaim' flag on its lock request. While the server is in reclaim lock mode during its grace period, it only accepts these special 'reclaim' lock requests; ordinary lock requests are told to try again later with a special result code that tells the NFS client that the server is in the reclaim grace period.
(As with NFS client reboot notices, I believe it's harmless for an NFS server to send such notices to a client that doesn't think it holds locks from the server.)
These NFS client reclaim requests don't necessarily succeed, for various reasons (including two NFS clients thinking they both hold a lock on the same file). And I believe it's always possible for a NFS client to simply not have gotten the server's notification, so it has no idea it's supposed to start reclaiming locks and the locks it thinks it holds are, by default, invalid.
This notification process is actually a separate protocol from locking (which in NFS v3 is separate from the NFS protocol itself). Locking is the 'NLM' (Network Lock Manager) protocol; the bidirectional notification system is 'SM' or 'NSM' ((Network) Status Monitor).
In theory a NFS v3 server could allow you to force a re-synchronization of NFS lock state between server and clients at any time by flipping into reclaim mode, marking all of its locks as 'pending reclaim', sending out an 'I have rebooted' NSM notice, and then at the end of the grace period dropping any locks that hadn't been reclaimed by some client. This could even be reasonably non-intrusive. In practice I'm not sure any NFS server actually implements this; instead, I think they all treat server lock recovery as something that's only done on boot with no existing locks that have to be tracked and maybe dropped later.
NFS servers and clients typically store SM state somewhere on disk. You can read about Linux's normal approach in statd(8), and about FreeBSD's in rpc.statd(8). FreeBSD conveniently ships with the protocol definitions for NLM and (N)SM, which aren't too hard to read if you're interested.
|
|