In praise of
crash (and kernel source)
You might wonder how we actually found our ext2 locking issue. That is a war story too, but fortunately it's a relatively short one.
First, we tried to use magic SysRq commands to dump various information. Let me save you some time: this doesn't work in any meaningful way. When you have several thousand processes, asking the kernel for a magic SysRq task dump is functionally equivalent to rebooting the machine, and sometimes it was literally equivalent to it. We extracted essentially no useful information with this, and rebooted several production servers in the process.
Instead we turned to the
crash program (once we stumbled over it; see
for a usage overview). Crash has many features, but the one that
we cared about is that it has a command ('
foreach bt') that will
dump a kernel stack backtrace for every process, and since crash
is just a user mode program, running it didn't blow up the system
the way using magic SysRq did. The output wasn't perfect, since
other processes kept running and could change, appear, and disappear
while crash was trying to gather all of this information, but when
you have several hundred badly frozen processes, most of them aren't
going to change and most of the time it works out. (And when it
didn't, we could just re-run the command.)
The raw crash kernel backtraces were not immediately useful, as there was too much noise and confusion (and a certain amount of things that were plain garbled). I wrote a set of program to post-process and analyze them; these created simplified call traces, trimmed off internal locking routines, aggregated processes with identical call traces to give us counts of common paths, and analyzed what processes seemed to be doing based on the call traces.
This analysis caused certain things to jump out; there were a lot of processes waiting to acquire mutexes in only a few code paths. Figuring out what mutexes these code paths were trying to acquire took reading the kernel source, and then it took more reading to come up with a comprehensive list of what paths took these mutexes in order to do what, followed by the fun game of 'so, what could cause the holder of this mutex to block?'
All of this process wound up pointing to ext2 inode allocation and deallocation blocking on disk IO in order to read bitmap blocks; in the traces we could see a whole lot of processes waiting to get the mutex, and one process that was actually waiting on disk IO to finish.
(Well, if we got somewhat lucky with crash's timing we could see that; there was no guarantee that crash would look at the process doing IO before its IO completed. But one of the great things about using crash was that I could run it over and over and over again to get huge volumes of traces to do lots of analysis on, since it wasn't as if the problem was going away on me.)
Sidebar: what I'd really like (and what I hope exists)
Crash is a nice tool, but what would be better is something that looks directly at the mutex locking state in order to show what processes are waiting on what locks, what process holds each lock and what it's doing, and so on. Such a thing would have pointed pretty much directly at our problem, without the need for post-processing and analysis and so on.
(There's the promising looking lockstat, but it doesn't
look like it's enabled in common kernels; it's
you want to build your own. I don't know what
performance impacts turning it on has, but a slow system may be better
than one that explodes.)