Always make sure you really understand what your problem is

October 25, 2012

Our recent disk performance issue has been good for a bunch of learning experiences. I'm not talking so much about stuff like becoming familiar with DTrace or discovering blktrace, but the humbling kind where you look back at things in retrospect and learn something from your mistakes. The first of these valuable lessons I've learned is simple:

Before you solve your problem, make sure you understand it.

In particular, make sure you know the cause of your problem.

In one sense, our recent adventure started back when we looked at the performance problems with our mail spool and confidently said 'clearly we are hitting the random IO seek limits of physical disks so the solution is replacing them with SSDs'. At the time this seemed perfectly logical; we knew that physical disks could only sustain so many IOPs/sec, we knew that mail spool performance got worse when the IO load on the physical disks went up, and we 'knew' that IO on the mail spool was heavily random. I'm pretty sure that I confidently made this exact assertion more than once during meetings about the problem.

We made two closely related mistakes here. The basic mistake is that we never made any real attempt to verify our theory. Since our theory was that the disks were saturating IOPs/sec, a very basic test would have been to actually measure the IOPs/sec we were seeing during high load conditions. In retrospect I'm pretty sure we would have found that the disks were nowhere near maxed out.

(As you might guess from the scare quotes around one 'knew', we also never made any significant attempt to verify that mail spool IO was mostly random IO.)

The bigger, more advanced mistake we made is that we never attempted more than a superficial investigation of why our mail spool was performing badly when it was on normal disks. We jumped straight from the problem to a theory of the cause to the attractive conclusion of 'we can solve the problem with SSDs'. Pretty much everything I did to look into the problem could have been done right from the start, so if I'd actually looked I would have found our switch issue. It's just that we never bothered. Instead we 'solved' our problem with SSDs without taking the time to understand what it was or finding out what was causing it.

(Our initial idea of what the problem was was 'slow IO', which was wrong as we thought of it. Our real problem wasn't that IO was slow in general, it was that more than 5% of it was very, very slow.)

It's quite attractive to not look into your problems in depth because it saves you a lot of time and aggravation. It took me more than a week of full-time work to run down our problem; that's a week I wasn't working on any other of our many projects. At the least it takes a lot of willpower to drop everything else that's clamouring for your attention and spend a lot of work on something where you're confident that you already know the answer and you're just making sure.

Of course this is where we should have started with basic verification of our theory. Basic verification would probably have disproven it with relatively little effort, and that might have provided enough of a reason to dig deeper. I might not have spent the week of time all in one block, but I could have nibbled away at it in bits and pieces. And I even was supposed to look into performance tools.

(Oh how I look back at that entry and admire the grim irony. Past me explicitly told future me that I should spend time to build up some performance tools before another crisis struck, yet I didn't listen and then exactly what past me predicted came to pass. Let's see if I can do better this time around.)

Written on 25 October 2012.
« Why you should support 'reload' as well as 'restart'
Thinking about an unusual sequence »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Oct 25 01:45:23 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.