Wandering Thoughts archives

2012-12-04

You can't assume that your performance problems will be obvious

One of the things underlying us not checking to make sure our harmless change actually was harmless is that we made an assumption; we assumed that if we had a performance problem it would be obvious and therefor the lack of any obvious performance problem meant that everything was fine. In fact we assumed this all through our disk performance issues. For example, because only the mail spool had any particularly obvious performance issues we assumed that only the mail spool had issues (and that we understood them).

What this drives home is something that I've kind of seen before:

Not all performance problems are obvious.

The easy performance problems are the obvious ones, where something simply explodes under load or at least scrapes along being clearly unacceptably slow. But those aren't the only problems you can have. You can also have performance problems that degrade your system in a subtle way, where nothing is obviously broken and everything just kind of goes slowly. If you don't look closely and especially if you have a complex environment you may simply say 'well, this is pretty much the best performance we can expect'. This is exactly what happened with us. We assumed that the performance we were seeing was what we could reasonably expect and that any slow increase in problems were simply because of growing usage and activity. In the end we only found our performance problem because we were smacked in the nose and actively went looking.

(We were lucky in that when we started looking in detail we could see that things were clearly broken at a low level.)

On the one hand it's important to recognize that the absence of evidence of performance problems is not very strong evidence for the absence of performance problems. On the other hand we can't be constantly suspicious of our systems because most of the time we aren't going to find anything if we go looking (and pretty soon we'll stop wasting our time by doing so). This is another case where metrics come in; metrics are constantly suspicious on our behalf.

One important corollary of this is that performance can degrade quietly. If performance problems are not obvious, you can go from good performance to performance problems without any obvious sign; this is especially so if the performance problems happened gradually. Once again this is where metrics come in, not just because they're suspicious on your behalf but because they keep history (if configured correctly).

(This applies to programming as much as to system administration. In fact programs are famous for having their performance, memory usage, and so on suffer a death of a thousand tiny little cuts, each change insignificant by itself but the cumulative total resulting in disaster.)

tech/NonObviousPerformanceIssues written at 01:01:57; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.