2025-02-26
I'm not impressed by the state of NFS v4 in the Linux kernel
Although NFS v4 is (in theory) the latest great thing in NFS protocol versions, for a long time we only used NFS v3 for our fileservers and our Ubuntu NFS clients. A few years ago we switched to NFS v4 due to running into a series of problems our people were experiencing with NFS (v3) locks (cf); since NFS v4 locks are integrated into the protocol and NFS v4 is the 'modern' NFS version that's probably receiving more attention than anything to do with NFS v3.
(NFS v4 locks are handled relatively differently than NFS v3 locks.)
Moving to NFS v4 did fix our NFS lock issues in that stuck NFS locks went away, when before they'd been a regular issue on our IMAP server. However, all has not turned out to be roses, and the result has left me not really impressed with the state of NFS v4 in the Linux kernel. In Ubuntu 22.04's 5.15.x server kernel, we've now run into scalability issues in both the NFS server (which is what sparked our interest in how many NFS server threads to run and what NFS server threads do in the kernel), and now in the NFS v4 client (where I have notes that let me point to a specific commit with the fix).
(The NFS v4 server issue we encountered may be the one fixed by this commit.)
What our two issues have in common is that both are things that you only find under decent or even significant load. That these issues both seem to have still been present as late as kernels 6.1 (server) and 6.6 (client) suggests that neither the Linux NFS v4 server nor the Linux NFS v4 client had been put under serious load until then, or at least not by people who could diagnose their problems precisely enough to identify the problem and get kernel fixes made. While both issues are probably fixed now, their past presence leaves me wondering what other scalability issues are lurking in the kernel's NFS v4 support, partly because people have mostly been using NFS v3 until recently (like us).
We're not going to go back to NFS v3 in general (partly because of the clear improvement in locking), and the server problem we know about has been wiped away because we're moving our NFS fileservers to Ubuntu 24.04 (and some day the NFS clients will move as well). But I'm braced for further problems, including ones in 24.04 that we may be stuck with for a while.
PS: I suspect that part of the issues may come about because the Linux NFS v4 client and the Linux NFS v4 server don't add NFS v4 operations at the same time. As I found out, the server supports more operations than the client uses but the client's use is of whatever is convenient and useful for it, not necessarily by NFS v4 revision. If the major use of Linux NFS v4 servers is with v4 clients, this could leave the server implementation of operations under-used until the client starts using them (and people upgrade clients to kernel versions with that support).