2021-04-09
Why NFS servers generally have a 'reply cache'
In the beginning, NFS operated over UDP, with each NFS request and
each NFS reply in a separate UDP packet (possibly fragmented). UDP
has the charming property that it can randomly drop arbitrary packets
(and also reorder them). If UDP drops a NFS client's request to the
server, the NFS client will resent resend it (a
'retransmit' in the jargon of NFS). If UDP drops the server's reply
to a client's request, the client will also resend the request,
because it can't really tell why it didn't get a reply; it just
knows that it didn't.
(Since clients couldn't tell the difference between a sufficiently slow server and packet loss, they also reacted to slow servers by retransmitting their requests.)
A lot of NFS operations are harmless to repeat when the server's response is lost. For instance, repeating any operation that reads or looks up things simply gives the client the current version of the state of things; if this state is different than it was before, it's pretty much a feature that the client gets a more up to date version. However, some operations are very dangerous to repeat if the server response is lost, because the result changes in a bad way. For example, consider a client performing a MKDIR operation that it's using for locking. The first time, the client succeeds but the server's reply is lost; the second time, the client's request fails because the directory now exists, and the server's reply reaches the client. Now you have a stuck lock; the client has succeeded in obtaining the lock but thinks it failed and so nothing is ever going to release the lock.
(This isn't the only way NFS file-based locking problems can happen.)
To try to work around this issue, NFS servers soon introduced the idea of a "reply cache", which caches the NFS server's reply to various operations that are considered dangerous for clients to repeat. The hope and the idea is that when a client resends such a request that the server has already handled, the server will find its reply in this cache and repeat it to the client. Of course this isn't a guaranteed cure, since the cache has a finite size (and I think it's usually not aware of other operations that might invalidate its answers).
In the days of NFS over UDP and frequent packet loss and retransmits, the reply cache was very important. These days, NFS over TCP uses TCP retransmits below the level that the NFS server and client see, so sent server replies are very hard to lose and actual NFS level retransmissions are relatively infrequent (and I think they're more often from the client deciding that the server is too slow than from actual lost replies).
In past entries (eg on how NFS in unreliable for file-based locking), I've said that this is done for operations that aren't idempotent. This is not really correct. There are very few NFS operations that are truly idempotent if re-issued after a delay; a READDIR might see a new entry, for example, or READ could see updated data in a file. But these differences are not considered dangerous in the way that a MKDIR going from success to failure is, and so they are generally not cached in the reply cache in order to leave room for the operations where it really matters.
(Thus, the list of non-cached NFS v3 operations in the Linux kernel NFS server mostly isn't surprising. I do raise my eyes a little bit at COMMIT, since it may return an error. Hopefully the Linux NFS server insures that a repeated COMMIT gets the same error again.)
What NFSv3 operations can be in the Linux nfsd reply cache
The Linux kernel NFS server (nfsd) provides a number of statistics
in /proc/net/rpc/nfsd
, which are often then exposed by metrics
agents such as the Prometheus host agent. One guide to of what
overall information is in this rpc/nfsd file is SvennD's nfsd
stats explained. The
first line is for the "reply cache", which caches replies to NFS
requests so they can be immediately sent back if a duplicate request
comes in. The three numbers provided for the reply cache are cache
hits, cache misses, and the number of requests that aren't cacheable
in the first place. A common explanation of this 'nocache' number is,
well, I will quote the Prometheus host agent's help text for its version
of this metric:
# HELP node_nfsd_reply_cache_nocache_total Total number of NFSd Reply Cache non-idempotent operations (rename/delete/…).
Knowing how many renames, deletes, and so on were going on seemed like a good idea, so I put a graph of this (and some other nfsd RPC numbers) into one of our Grafana dashboards. To my slowly developing surprise, generally almost all of the requests to the NFS servers I was monitoring fell into this 'nocache' category (which was also SvennD's experience, recounted in their entry). So I decided to find out what NFSv3 operations were cacheable and which ones weren't. The answer was surprising.
For NFSv3 operations, the answer is in the big nfsd_procedures3
array at the end of fs/nfsd/nfs3proc.c.
Operations with a pc_cachetype
of RC_NOCACHE
aren't cacheable;
entries with other values are. The non-cacheable NFSv3 operations
are:
access commit fsinfo fsstat getattr lookup null pathconf read readdir readdirplus readlink
The cacheable ones are:
create link mkdir mknod remove rename rmdir setattr symlink write
NFS v3 operations that read information are not cacheable in the reply cache and show up in the 'nocache' category, while NFS v3 operations that change the filesystem mostly are cacheable.
Contrary to what you might guess from the Prometheus host agent's help text and various other sources, the non-cacheable NFS v3 operations aren't things like RENAME and CREATE, they are instead the NFSv3 operations that just read things from the filesystem (with the exception of COMMIT). In particular, GETATTR is an extremely frequent operation, so it's no wonder that most of the time the 'nocache' category dominated in my stats.
If you want to track the number of creates, writes, and so on, what you want to track is the number of misses to the reply cache. Tracking the 'nocache' number tells you how many read operations are happening.
(All of this makes sense once you understand why the reply cache is (or was) necessary, which is for another entry. I actually knew this as background NFS protocol knowledge, but I didn't engage that part of my memory when I was putting together the Grafana graph and had that tempting help text staring me in the face.)