The importance of figuring out low-level symptoms of problems

April 12, 2010

Suppose that you have an IMAP server; it has mirrored local system disks, a bunch of memory, and a data filesystem (where the mailboxes are) in a RAID-10 array provided by a SAN. One day, it starts falling over unpredictably; the load average goes to the many hundreds, IMAP service times go into the toilet, and eventually the machine has to be force-booted. But this isn't consistent, and when it happens it happens very rapidly, going from a normal tiny load average to a load average of hundreds in a few minutes.

Believe it or not, this is a very high level and abstract description of your issue (although it may sound quite precise). But, clearly, it doesn't tell you what's wrong and what you need to do to fix things. What is not necessarily obvious until you've been through this a few times is that one of the important steps to solving things is finding out lower-level symptoms of the problem (and in the process finding out all of the lower-level things that are unrelated).

Finding lower-level symptoms has several important effects. First, it gives you good diagnostics to determine when the problem is happening. High level diagnostics are necessarily broad and unspecific, they can lag behind when the problem actually starts, and causing them to manifest can often depend on your entire production environment, making them hard to use in test scenarios.

Next, it gives you a good way to measure if you've reproduced the problem in artificial test scenarios. Anyone can drive an IMAP server into the ground with sufficient load; the trick is to be sure that you're driving it into the ground in the same way that your production environment is being driven into the ground. (Even if you are running test simulations using captured trace data, you don't know for sure.)

Finally, it gives you a good lead on tracking down why the problem is happening. Now that you know a lower-level symptom or two, you can start asking focused why questions to figure out how the symptom comes about, and it becomes sensible to dig into detailed trace data, kernel source, and so on. For example, if the time it takes to touch and remove a file is a big indicator of the problem, you can now start looking at what can make that slow.

Without those low-level symptoms, you can spend weeks going around in circles, running cross-correlations against every statistic that you can gather, trying artificial test after artificial test to see if you can reproduce something that looks like the problem, and descending to guesswork and superstition in making system changes to see if they fix the problem (eg, 'maybe adding more memory will do it').

Of course, some amount of this activity is useful to actually find those low-level symptoms, but don't lose sight of your first goal in amidst the yelling and the looking. In particular, I've come to feel that trying to do artificial reproduction of the problem is mostly a waste of time until you actually understand what the problem is; you really need a good diagnosis and a better understanding of what is really going on.

(Honesty compels me to admit that this is pretty hard to carry off when people are yelling in your ear because your IMAP server is a smoking crater in the ground on a regular basis and they need a fix now.)

Sidebar: the problem with general why questions

One of the ways of working on this overall problem is to ask why questions: why is IMAP response time slow? When someone issues an IMAP command, what operations that the IMAP server does are taking so long? Why is the load average high?

The problem with why questions is that they often either run you into dead ends or rapidly become extremely difficult and complex to get the answers to. The load average is high because you have hundreds of processes in disk wait despite iostat reporting relatively normal numbers; a single IMAP operation makes a thousand system calls, and there's no clear pattern as to which ones take 'too long' given that the system has a load average in the hundreds. Asking lots of questions (and getting lots of answers) is a distraction, because it leaves you with the job of picking through the clutter to find the important bits.

(Having said that, there is an important clue in this list of symptoms. You could get interesting results from asking 'so, why are processes in disk wait despite good iostat numbers?'.)

Written on 12 April 2010.
« The comedy potential inherent in people reusing your address space
The impact of single-disk slow writes on mirrors (and other RAID arrays) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Apr 12 01:54:45 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.