Linux NFS client kernel tunable settings
We had a serious (lack of) performance issue today with a Linux NFS client machine, so I spent some time delving into the underdocumented world of kernel parameters that affect the Linux NFS client (not the NFS server, which has better documented stuff).
(I am going to use
sysctl(8) notation for kernel tunables.)
The major tunable is
tcp_slot_table_entries. These are the maximum number of outstanding
NFS RPC requests allowed per (UDP or TCP) RPC connection; the default is
16 and the maximum is 128 (and the minimum is 2). I believe that this
is effectively per NFS server (technically per NFS server IP address),
because it appears that the kernel reuses the same RPC connection for
all NFS filesystems mounted from the same server IP address.
Unfortunately existing RPC connections are not resized if you change the number of slot table entries.
Contrary to what you might read in various places, changing
[rw]mem_max is not necessary and
does not help. The kernel RPC client code directly sets its send and
receive buffer sizes based on the read/write size and the number of
slot table entries it has, and ignores
[rw]mem_max in the process;
wmem_max only limit the sizes that user-level code
(This does mean that if you set a high slot table size and mount from a lot of different NFS servers, you could possibly use up a decent amount of kernel memory with socket send buffers.)
If you are doing NFS over UDP, as we are for some fileservers, you may
want to check the value of
net.ipv4.ipfrag_high_thresh, but I'm not
sure what a good value would be. I suspect that the minimum size should
be enough memory to reassemble a full-sized read from every different
NFS fileserver at once.
(I believe this is a global amount of memory, not per connection or per fileserver, so it is safe to set it to several megabytes.)
It's possible that you will also want to increase
net.core.netdev_max_backlog, the maximum number of received network
packets that can be queued for processing, because it kicks in before
fragment reassembly. It's safest to consider it a global limit, although
it's not quite that.
(It is a per-CPU queue limit, but you can't be sure that all of your network packet receive interrupts won't wind up being handled by the same CPU in a multi-CPU system).