In praise of crash (and kernel source)

January 11, 2010

You might wonder how we actually found our ext2 locking issue. That is a war story too, but fortunately it's a relatively short one.

First, we tried to use magic SysRq commands to dump various information. Let me save you some time: this doesn't work in any meaningful way. When you have several thousand processes, asking the kernel for a magic SysRq task dump is functionally equivalent to rebooting the machine, and sometimes it was literally equivalent to it. We extracted essentially no useful information with this, and rebooted several production servers in the process.

Instead we turned to the crash program (once we stumbled over it; see also here for a usage overview). Crash has many features, but the one that we cared about is that it has a command ('foreach bt') that will dump a kernel stack backtrace for every process, and since crash is just a user mode program, running it didn't blow up the system the way using magic SysRq did. The output wasn't perfect, since other processes kept running and could change, appear, and disappear while crash was trying to gather all of this information, but when you have several hundred badly frozen processes, most of them aren't going to change and most of the time it works out. (And when it didn't, we could just re-run the command.)

The raw crash kernel backtraces were not immediately useful, as there was too much noise and confusion (and a certain amount of things that were plain garbled). I wrote a set of program to post-process and analyze them; these created simplified call traces, trimmed off internal locking routines, aggregated processes with identical call traces to give us counts of common paths, and analyzed what processes seemed to be doing based on the call traces.

This analysis caused certain things to jump out; there were a lot of processes waiting to acquire mutexes in only a few code paths. Figuring out what mutexes these code paths were trying to acquire took reading the kernel source, and then it took more reading to come up with a comprehensive list of what paths took these mutexes in order to do what, followed by the fun game of 'so, what could cause the holder of this mutex to block?'

All of this process wound up pointing to ext2 inode allocation and deallocation blocking on disk IO in order to read bitmap blocks; in the traces we could see a whole lot of processes waiting to get the mutex, and one process that was actually waiting on disk IO to finish.

(Well, if we got somewhat lucky with crash's timing we could see that; there was no guarantee that crash would look at the process doing IO before its IO completed. But one of the great things about using crash was that I could run it over and over and over again to get huge volumes of traces to do lots of analysis on, since it wasn't as if the problem was going away on me.)

Sidebar: what I'd really like (and what I hope exists)

Crash is a nice tool, but what would be better is something that looks directly at the mutex locking state in order to show what processes are waiting on what locks, what process holds each lock and what it's doing, and so on. Such a thing would have pointed pretty much directly at our problem, without the need for post-processing and analysis and so on.

(There's the promising looking lockstat, but it doesn't look like it's enabled in common kernels; it's CONFIG_LOCK_STAT if you want to build your own. I don't know what performance impacts turning it on has, but a slow system may be better than one that explodes.)

Written on 11 January 2010.
« Why the Solaris packaging system is bad
How systems get created in the real world »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 11 01:35:49 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.