How we're making updated versions of a file rapidly visible on our Linux NFS clients

April 24, 2019

Part of our automounter replacement is a file with a master list of all NFS mounts that client machines should have, which we hold in our central administrative filesystem that all clients NFS mount. When we migrate filesystems from our old fileservers to our new fileservers, one of the steps is to regenerate this list with the old filesystem mount not present, then run a mount update on all of the NFS clients to actually unmount the filesystem from the old fileserver. For a long time, we almost always had to wait a bit of time before all of the NFS clients would reliably see the new version of the NFS mounts file, which had the unfortunate effect of slowing down filesystem migrations.

(The NFS mount list is regenerated on the NFS fileserver for our central administrative filesystem, so the update is definitely known to the server once it's finished. Any propagation delays are purely on the side of the NFS clients, who are holding on to some sort of cached information.)

In the past, I've made a couple of attempts to find a way to reliably get the NFS clients to see that there was a new version of the file by doing things like flock(1)'ing it before reading it. These all failed. Recently, one of my co-workers discovered a reliable way of making this work, which was to regenerate the NFS mount list twice instead of once. You didn't have to delay between the two regenerations; running them back to back was fine. At first this struck me as pretty mysterious, but then I came up with a theory for what's probably going on and why this makes sense.

You see, we update this file in a NFS-safe way that leaves the old version of the file around under a different name so that programs on NFS clients that are reading it at the time don't have it yanked out from underneath them. As I understand it, Linux NFS clients cache the mapping from filesystem names to NFS filehandles for some amount of time, to reduce various sorts of NFS lookup traffic (now that I look, there is a discussion pointing to this in the nfs(5) manpage). When we do one regeneration of our nfs-mounts file, the cached filehandle that clients have for that name mapping is still valid (and the file's attributes are basically unchanged); it's just that it's for the file that is now nfs-mounts.bak instead of the new file that is now nfs-mounts. Client kernels are apparently still perfectly happy to use it, and so they read and use the old NFS mount information. However, when we regenerate the file twice, this file is removed outright and the cached filehandle is no longer valid. My theory and assumption is that modern Linux kernels detect this situation and trigger some kind of revalidation that winds up with them looking up and using the correct nfs-mounts file (instead of, say, failing with an error).

(It feels ironic that apparently the way to make this work for us here in our NFS environment is to effectively update the file in an NFS-unsafe way for once.)

PS: All of our NFS clients here are using either Ubuntu 16.04 or 18.04, using their stock (non-HWE) kernels, so various versions of what Ubuntu calls '4.4.0' (16.04) and '4.15.0' (18.04). Your mileage may vary on different kernels and in different Linux environments.

Written on 24 April 2019.
« The appeal of using plain HTML pages
Various aspects of Python made debugging my tarfile problem unusual »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Apr 24 23:20:58 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.