The traditional workaround for stuck NFS(v3) locks

April 21, 2023

Up through NFS v3, advisory file locks over NFS were done through a separate protocol and set of systems, the "Network Lock Manager" (NLM) set of protocols (which I believe are best covered in File Locking over XNFS and Network Lock Manager Protocol). File locking is naturally a stateful system, where the server and the clients have to have the same state, but unfortunately the NFS v3 NLM protocol doesn't provide for any way for servers or clients to explicitly check that they agree on how things are. In theory this shouldn't happen; in practice, well, it does.

(NFS locking is designed to deal with server and client reboots, but either doing one or simulating it tends to be pretty disruptive.)

The most visible way for the server and clients to become de-synchronized is a stuck lock, where the server believes a client has a file locked but the client thinks it doesn't. A file with a stuck lock can never be (re-)locked by any programs trying to do so, and it will normally stay that way until the reboot of either or both of the server or the client the server thinks has the file locked. As a result, working around these stuck locks has been a concern of NFS server system administrators for a long time and people have come up with a traditional brute force solution.

(In informal conversation sysadmins may talk about 'clearing' a stuck lock this way, but we're not really doing that; we're working around it with brute force.)

The traditional brute force workaround is to carefully stop everything that would ordinarily try to touch the stuck file, copy it to a new file, and then if necessary rename the new file back to the old name. Often you then remove the old stuck file. Then you let whatever programs or systems were trying to lock things start up again so they can go back to using and locking the file. Often this can be as simple as:

; mv fileA fileA-stuck
; cp -a fileA-stuck fileA

(There are many variations of this 'copy and rename' process, depending on what you're worried about and how you want to proceed.)

This works because (NFS) file locks are almost invariably attached to the file's inode instead of its name. When you rename and copy the file, the new version has the same name and the same contents (well, we hope), but a different inode, one that the NFS server doesn't consider to be locked.

(When you delete the old file with its old inode, the server will generally drop the lock and even if it doesn't, you don't care any more; the file is inaccessible and no one will try to lock it by accident.)

These days, most people don't deal with NFS and when they do it's with NFS v4, which has locking integrated into the core protocol (and as a result, I believe has more reliable locking). This brute force workaround for stuck NFS v3 file locks is drifting toward cursed knowledge, if it isn't already there.

(We still have NFS v3 fileservers, so every so often this is relevant to us.)

Written on 21 April 2023.
« Setting the ARC target size in ZFS on Linux (as of ZoL 2.1)
The two types of C programmers (a provocative thesis) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Apr 21 21:42:34 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.