Why NFS servers generally have a 'reply cache'

April 9, 2021

In the beginning, NFS operated over UDP, with each NFS request and each NFS reply in a separate UDP packet (possibly fragmented). UDP has the charming property that it can randomly drop arbitrary packets (and also reorder them). If UDP drops a NFS client's request to the server, the NFS client will resent resend it (a 'retransmit' in the jargon of NFS). If UDP drops the server's reply to a client's request, the client will also resend the request, because it can't really tell why it didn't get a reply; it just knows that it didn't.

(Since clients couldn't tell the difference between a sufficiently slow server and packet loss, they also reacted to slow servers by retransmitting their requests.)

A lot of NFS operations are harmless to repeat when the server's response is lost. For instance, repeating any operation that reads or looks up things simply gives the client the current version of the state of things; if this state is different than it was before, it's pretty much a feature that the client gets a more up to date version. However, some operations are very dangerous to repeat if the server response is lost, because the result changes in a bad way. For example, consider a client performing a MKDIR operation that it's using for locking. The first time, the client succeeds but the server's reply is lost; the second time, the client's request fails because the directory now exists, and the server's reply reaches the client. Now you have a stuck lock; the client has succeeded in obtaining the lock but thinks it failed and so nothing is ever going to release the lock.

(This isn't the only way NFS file-based locking problems can happen.)

To try to work around this issue, NFS servers soon introduced the idea of a "reply cache", which caches the NFS server's reply to various operations that are considered dangerous for clients to repeat. The hope and the idea is that when a client resends such a request that the server has already handled, the server will find its reply in this cache and repeat it to the client. Of course this isn't a guaranteed cure, since the cache has a finite size (and I think it's usually not aware of other operations that might invalidate its answers).

In the days of NFS over UDP and frequent packet loss and retransmits, the reply cache was very important. These days, NFS over TCP uses TCP retransmits below the level that the NFS server and client see, so sent server replies are very hard to lose and actual NFS level retransmissions are relatively infrequent (and I think they're more often from the client deciding that the server is too slow than from actual lost replies).

In past entries (eg on how NFS in unreliable for file-based locking), I've said that this is done for operations that aren't idempotent. This is not really correct. There are very few NFS operations that are truly idempotent if re-issued after a delay; a READDIR might see a new entry, for example, or READ could see updated data in a file. But these differences are not considered dangerous in the way that a MKDIR going from success to failure is, and so they are generally not cached in the reply cache in order to leave room for the operations where it really matters.

(Thus, the list of non-cached NFS v3 operations in the Linux kernel NFS server mostly isn't surprising. I do raise my eyes a little bit at COMMIT, since it may return an error. Hopefully the Linux NFS server insures that a repeated COMMIT gets the same error again.)

Written on 09 April 2021.
« What NFSv3 operations can be in the Linux nfsd reply cache
Vendors put varied and peculiar things in system DMI information »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Apr 9 22:01:48 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.