Linux NFS client kernel tunable settings

September 18, 2007

We had a serious (lack of) performance issue today with a Linux NFS client machine, so I spent some time delving into the underdocumented world of kernel parameters that affect the Linux NFS client (not the NFS server, which has better documented stuff).

(I am going to use sysctl(8) notation for kernel tunables.)

The major tunable is sunrpc.udp_slot_table_entries and/or tcp_slot_table_entries. These are the maximum number of outstanding NFS RPC requests allowed per (UDP or TCP) RPC connection; the default is 16 and the maximum is 128 (and the minimum is 2). I believe that this is effectively per NFS server (technically per NFS server IP address), because it appears that the kernel reuses the same RPC connection for all NFS filesystems mounted from the same server IP address.

Unfortunately existing RPC connections are not resized if you change the number of slot table entries.

Contrary to what you might read in various places, changing net.core.[rw]mem_default and [rw]mem_max is not necessary and does not help. The kernel RPC client code directly sets its send and receive buffer sizes based on the read/write size and the number of slot table entries it has, and ignores [rw]mem_max in the process; rmem_max and wmem_max only limit the sizes that user-level code can set.

(This does mean that if you set a high slot table size and mount from a lot of different NFS servers, you could possibly use up a decent amount of kernel memory with socket send buffers.)

If you are doing NFS over UDP, as we are for some fileservers, you may want to check the value of net.ipv4.ipfrag_high_thresh, but I'm not sure what a good value would be. I suspect that the minimum size should be enough memory to reassemble a full-sized read from every different NFS fileserver at once.

(I believe this is a global amount of memory, not per connection or per fileserver, so it is safe to set it to several megabytes.)

It's possible that you will also want to increase net.core.netdev_max_backlog, the maximum number of received network packets that can be queued for processing, because it kicks in before fragment reassembly. It's safest to consider it a global limit, although it's not quite that.

(It is a per-CPU queue limit, but you can't be sure that all of your network packet receive interrupts won't wind up being handled by the same CPU in a multi-CPU system).

Written on 18 September 2007.
« How mmap(2) requires a unified buffer cache
The benefit of chronological blog navigation »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Sep 18 23:34:45 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.