How we're making updated versions of a file rapidly visible on our Linux NFS clients
Part of our automounter replacement is a file with a master list of all NFS mounts that client machines should have, which we hold in our central administrative filesystem that all clients NFS mount. When we migrate filesystems from our old fileservers to our new fileservers, one of the steps is to regenerate this list with the old filesystem mount not present, then run a mount update on all of the NFS clients to actually unmount the filesystem from the old fileserver. For a long time, we almost always had to wait a bit of time before all of the NFS clients would reliably see the new version of the NFS mounts file, which had the unfortunate effect of slowing down filesystem migrations.
(The NFS mount list is regenerated on the NFS fileserver for our central administrative filesystem, so the update is definitely known to the server once it's finished. Any propagation delays are purely on the side of the NFS clients, who are holding on to some sort of cached information.)
In the past, I've made a couple of attempts to find a way to reliably
get the NFS clients to see that there was a new version of the file
by doing things like
flock(1)'ing it before
reading it. These all failed. Recently, one of my co-workers
discovered a reliable way of making this work, which was to regenerate
the NFS mount list twice instead of once. You didn't have to delay
between the two regenerations; running them back to back was fine.
At first this struck me as pretty mysterious, but then I came up
with a theory for what's probably going on and why this makes sense.
You see, we update this file in a NFS-safe way that leaves the old version
of the file around under a different name so that programs on NFS
clients that are reading it at the time don't have it yanked out
from underneath them.
As I understand it, Linux NFS clients cache the mapping from
filesystem names to NFS filehandles
for some amount of time, to reduce various sorts of NFS lookup
traffic (now that I look, there is a discussion pointing to this
nfs(5) manpage). When we do one
regeneration of our
nfs-mounts file, the cached filehandle that
clients have for that name mapping is still valid (and the file's
attributes are basically unchanged); it's just that it's for the
file that is now
nfs-mounts.bak instead of the new file that is
nfs-mounts. Client kernels are apparently still perfectly
happy to use it, and so they read and use the old NFS mount
information. However, when we regenerate the file twice, this file
is removed outright and the cached filehandle is no longer valid.
My theory and assumption is that modern Linux kernels detect this
situation and trigger some kind of revalidation that winds up with
them looking up and using the correct
nfs-mounts file (instead of,
say, failing with an error).
(It feels ironic that apparently the way to make this work for us here in our NFS environment is to effectively update the file in an NFS-unsafe way for once.)
PS: All of our NFS clients here are using either Ubuntu 16.04 or 18.04, using their stock (non-HWE) kernels, so various versions of what Ubuntu calls '4.4.0' (16.04) and '4.15.0' (18.04). Your mileage may vary on different kernels and in different Linux environments.