When you make a harmless change, check to make sure that it is

November 28, 2012

As I mentioned before, our recent disk performance issue has been good for a number of learning experiences. One of the most painful ones (for me) comes from the fact that this was actually a self-inflicted performance problem. You see, we did not start out using two different sorts of switches on our two iSCSI networks. Initially both networks used the lower-end switches until at one point where for various reasons we decided to swap one out for the higher end and highly trusted core switch. After the change everything appeared to work fine and because we were sure it was a harmless change we didn't try to do any performance tests.

Let me repeat that, rephrased: because we thought this was a harmless change, we didn't check to make sure that it really was. And it turned out that we were very wrong about it; our harmless change of switch models led to a quiet but significant degradation in disk IO performance that lasted for, well, a rather long time. Had we looked at performance numbers before and after the change we might well have discovered the problem immediately.

(Even crude metrics might have been good enough, like the amount of time it took to do backups or ZFS pool scrubs.)

This is one of the corollaries of our inability to understand performance in advance. In real systems, we don't actually know in advance that something's going to be a harmless change that has no effect on the system. The best we can do is make educated guesses and they're sometimes wrong. So the only way to actually know that a change is harmless is to make the change and then measure to verify that no important performance characteristics have changed.

(In fact this is a narrow subset of the general rule that we can't necessarily predict the effects of changes in advance. We can come up with heavily informed theories, but per above they can be wrong even under near-ideal situations. In the end the proof is in the pudding.)

One direct lesson I'm taking from this is that any future changes to our iSCSI and fileserver infrastructure need to have a bunch of performance tests run before and afterwards. More than that, we should be periodically checking our performance even when we think that nothing has changed; as we've seen before, things can change without us doing anything.

(There's some broader things that this is teaching me, but I'm not going to try to put them in this entry (partly because when I start to do so the clear view I have of them in my head slides away from me, which is a good sign that they're not actually clear enough yet).)


Comments on this page:

From 90.155.35.116 at 2012-11-28 04:33:21:

You're never going to remember to run this, and some changes are safe enough, or perhaps unrelated enough that you won't think to run these tests.

Better yet would be to set something up to run load tests at off hours and graph the results (and/or just monitor performance with current natural load). Then when you have a regression you can look at the graph and say "It started happening on Friday the 13th. What did we change then?". You can also put alerts on performance dropping too far for too long so you can discover that you have a performance regression you didn't know about.

Obviously you need to figure out if you're graphing the right thing as you've pointed out previously (eg, 95% latency vs mean latency).

-- Perry Lorier

Written on 28 November 2012.
« You don't necessarily know what matters for performance
One reason why having metrics is important »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Nov 28 02:04:09 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.