The Prometheus host agent is missing some Linux NFSv4 RPC stats (as of 1.8.2)
Over on the Fediverse I said:
This is my face when the Prometheus host agent provides very incomplete monitoring of NFS v4 RPC operations on modern kernels that can likely hide problems. For NFS servers I believe that you get only NFS v4.0 ops, no NFS v4.1 or v4.2 ones. For NFS v4 clients things confuse me but you certainly don't get all of the stats as far as I can see.
When I wrote that Fediverse post, I hadn't peered far enough into the depths of the Linux kernel to be sure what was missing, but now that I understand the Linux kernel NFS v4 server and client RPC operations stats I can provide a better answer of what's missing. All of this applies to node_exporter as of version 1.8.2 (the current one as I write this).
(I now think 'very incomplete' is somewhat wrong, but not entirely so, especially on the server side.)
Importantly, what's missing is different for the server side and the client side, with the client side providing information on operations that the server side doesn't. This can make it very puzzling if you're trying to cross-compare two 'NFS RPC operations' graphs, one from a client and one from a server, because the client graph will show operations that the server graph doesn't.
In the host agent code, the actual stats are read from /proc/net/rpc/nfs and /proc/net/rpc/nfsd by a separate package, prometheus/procfs, and are parsed in nfs/parse.go. For the server case, if we cross compare this to the kernel's include/linux/nfs4.h, what's missing from server stats is all NFS v4.1, v4.2, and RFC 8276 xattr operations, everything from operation 40 through operation 75 (as I write this).
Because the Linux NFS v4 client stats are more confusing and aren't so nicely ordered, the picture there is more complex. The nfs/parse.go code handles everything up through 'Clone', and is missing from 'Copy' onward. However, both what it has and what it's missing are a mixture of NFS v4, v4.1, and v4.2 operations; for example, 'Allocate' and 'Clone' (both included) are v4.2 operations, while 'Lookupp', a v4.0 operation, is missing from client stats. If I'm reading the code correctly, the missing NFS v4 client operations are currently (using somewhat unofficial names):
Copy OffloadCancel Lookupp LayoutError CopyNotify Getxattr Setxattr Listxattrs Removexattr ReadPlus
Adding the missing operations to the Prometheus host agent would require updates to both prometheus/procfs (to add fields for them) and to node_exporter itself, to report the fields. The NFS client stats collector in collector/nfs_linux.go uses Go reflection to determine the metrics to report and so needs no updates, but the NFS server stats collector in collector/nfsd_linux.go directly knows about all 40 of the current operations and so would need code updates, either to add the new fields or to switch to using Go reflection.
If you want numbers for scale, at the moment node_exporter reports on 50 out of 69 NFS v4 client operations, and is missing 36 NFS v4 server operations (reporting on what I believe is 36 out of 72). My ability to decode what the kernel NFS v4 client and server code is doing is limited, so I can't say exactly how these operations match up and, for example, what client operations the server stats are missing.
(I haven't made a bug report about this (yet) and may not do so, because doing so would require making my Github account operable again, something I'm sort of annoyed by. Github's choice to require me to have MFA to make bug reports is not the incentive they think it is.)
|
|