A brief mention of some tools for debugging Linux NFS client issues

April 24, 2008

Someone here recently asked for tips on debugging a mysterious Linux NFS client hang. I didn't have any answers, but I did happen to know where to look for some Linux-specific tools. (The person had already exhausted the abilities of things like tcpdump to help.)

The most obvious thing is to use the magic SysRq to get a dump of the kernel call stacks of all processes (the t command). Once you find the hanging processes in all of the output, you can usually see what operations they're hanging on, both high level and somewhat low level.

(Here's where I observe that it's a pity that there's no way to ask for a magic SysRq dump of a specific process. Hopefully someone will now tell me that I'm wrong.)

The Linux NFS client also has its own debugging hooks, accessible through /proc/sys/sunrpc; unfortunately, they're rather underdocumented and magical. What you want are the files rpc_debug and nfs_debug, each of which is a bitmap of flags that control which RPC or NFS events get logged; you write a decimal integer to them to set the bitmap's value, or a 0 to turn off all logging.

(In addition, writing any number to rpc_debug will give you a cryptic dump of RPC 'task' information. Having just read through a bunch of kernel source code, my opinion is that there is almost nothing useful in it unless you are a kernel hacker. If you really want this dump and nothing else, write a 0 to rpc_debug.)

The values for the various things you can get reports of are found in the kernel source in include/linux/sunrpc/debug.h (the RPCDBG_ #defines) and include/linux/nfs_fs.h (the NFSDBG_ #defines). You can use a suitably large value like 32767 to turn everything on.

Note that this can produce a lot of kernel messages very fast, especially if you turn on lots of things. Also, one of the big reasons this stuff is not documented is that it is primarily intended for kernel hackers, so to understand the results you may need to go dig in the kernel NFS and RPC code (in fs/nfs and net/sunrpc respectively).

(There are similar debug files for the NFS server and for the NLM. Exploring these is left as an exercise for the reader.)

Written on 24 April 2008.
« The irritation of single-context applications
What Linux's RPC queue dump means, sort of »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Apr 24 00:33:59 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.