The general problem of losing network based locks

November 5, 2024

There are many situations and protocols where you want to hold some sort of lock across a network between, generically, a client (who 'owns' the lock) and a server (who manages the locks on behalf of clients and maintains the locking rules). Because a network is involved, one of the broad problems that can happen in such a protocol is that the client can have a lock abruptly taken away from it by the server. This can happen because the server was instructed to break the lock, or the server restarted in some way and notified the clients that they had lost some or all of their locks, or perhaps there was a network partition that led to a lock timeout.

When the locking protocol and the overall environment is specifically designed with this in mind, you can try to require clients to specifically think about the possibility. For example, you can have an API that requires clients to register a callback for 'you lost a lock', or you can have specific error returns to signal this situation, or at the very least you can have a 'is this lock still valid' operation (or 'I'm doing this operation on something that I think I hold a lock for, give me an error if I'm wrong'). People writing clients can still ignore the possibility, just as they can ignore the possibility of other network errors, but at least you tried.

However, network locking is sometimes added to things that weren't originally designed for it. One example is (network) filesystems. The basic 'filesystem API' doesn't really contemplate locking and especially it doesn't consider that you can suddenly have access to a 'file' taken away from you in mid-flight. If you add network locking you don't have a natural answer to handling losing locks and there's no obvious point in the API to add it, especially if you want to pretend that your network filesystem is the same as a local filesystem. This makes it much easier for people writing programs to not even think about the possibility of losing a network lock during operation.

(If you're designing a purely networked filesystem-like API, you have more freedom; for example, you can make locking operations turn a regular 'file descriptor' into a special 'locked file descriptor' that you have to do subsequent IO through and that will generate errors if the lock is lost.)

One of the meta-problems with handling losing a network lock is that there's no single answer for what you should do about it. In some programs, you've violated an invariant and the only safe move for the program is to exit or crash. In some programs, you can pause operations until you can re-acquire the lock. In other programs you need to bail out to some sort of emergency handler that persists things in another way or logs what should have been done if you still held the lock. And when designing your API (or APIs) for losing locks, how likely you think each option is will influence what features you offer (and it will also influence how interested programs are in handling losing locks).

PS: A contributing factor to programmers and programs not being interested in handling losing network locks is that they're generally somewhere between uncommon and rare. If lots of people are writing code to deal with your protocol and losing locks are uncommon enough, some amount of those people will just ignore the possibility, just like some amount of programmers ignore the possibility of IO errors.

Written on 05 November 2024.
« A rough equivalent to "return to last power state" for libvirt virtual machines
Losing NFS locks and the SunOS SIGLOST signal »

Page tools: View Source.
Search:
Login: Password:

Last modified: Tue Nov 5 22:38:59 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.