Look for your performance analysis tools now

May 5, 2012

Last night and this morning we had a significant NFS performance problem on one of our ZFS fileservers, which was a bit stressful. Our fileservers have multiple ZFS pools, each with multiple NFS exported filesystems, and the fileserver was just not responding very well for pretty much any of them. We got as far as determining that one mirror pair of disks for one particular pool were probably saturated (based on iostat figures), but that was a long way from being able to identify who was doing what on what filesystem to do this, or how it was affecting the whole fileserver (and that's if it was even the only problem, instead of the one that was easiest to notice with the tools we had at hand).

Our fileservers are Solaris machines, which means they have DTrace available. People have undoubtedly written DTrace scripts to analyze NFS server activity and performance, to track disk IO to ZFS events, and so on. Which is theoretically wonderful but leads me to a practical observation:

When you need performance analysis tools, it's too late to go find them.

When you're in the middle of a serious issue and need some diagnostic programs, you're in the position of someone who has waited until it's raining to buy roofing tools. At a minimum you're probably not going to do too well at evaluating your options and picking the best, most informative programs and then using them well; instead you're going to grab the first thing that looks like it might help and hammer on your problem some. If this doesn't work, grab the next script and see if it does any better. Repeat until you come up with something or the problem goes away on its own.

This situation may sound crazy but I think it's unfortunately a natural thing to have happen. If you don't currently have performance issues, it doesn't seem very urgent to spend limited time finding and playing around with performance analysis tools; you likely have plenty of higher priority things on your to-do list, things that either have to be done or that have high payoffs. Such low priority playing around is generally seen as a spare time activity (which in practice means it almost never gets done). This is certainly what happened here with me; I always knew that there might be interesting performance analysis tools available for Solaris but it never seemed sufficiently urgent to go investigate them and separate the wheat from the chaff, since I always had more important and engaging things to do.

What my experience today has rubbed my nose into is that this can easily be a false economy. The right thing to measure against isn't what else you could be doing with your time right now, it's how much time you would lose if (or when) you have performance problems and hadn't prepared ahead of time. Now is the right time to work out what tools you have available, how well they work, and how to use them; indeed, it may be the only time you'll have to do it well. Waiting until a crisis is too late. Preparing in advance is the smart thing to do (and it sounds so obvious when I write it like that).

(All of this is nice talk but I have no idea if I will be able to carry through with it, especially since trying to evaluate performance analysis tools without a performance problem is something that I usually find kind of tedious.)

PS: our ZFS fileserver issues turned out to have a somewhat interesting root cause that pretty much went away on its own.

Written on 05 May 2012.
« Explaining a piece of deep weirdness with Python's exec
Counting your syndication feed readers »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat May 5 01:46:15 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.