Look for your performance analysis tools now
Last night and this morning we had a significant NFS performance
problem on one of our ZFS fileservers, which was a bit stressful. Our fileservers
have multiple ZFS pools, each with multiple NFS exported filesystems,
and the fileserver was just not responding very well for pretty
much any of them. We got as far as determining that one mirror
pair of disks for one particular pool were probably saturated (based
iostat figures), but that was a long way from being able to
identify who was doing what on what filesystem to do this, or how
it was affecting the whole fileserver (and that's if it was even
the only problem, instead of the one that was easiest to notice
with the tools we had at hand).
Our fileservers are Solaris machines, which means they have DTrace available. People have undoubtedly written DTrace scripts to analyze NFS server activity and performance, to track disk IO to ZFS events, and so on. Which is theoretically wonderful but leads me to a practical observation:
When you need performance analysis tools, it's too late to go find them.
When you're in the middle of a serious issue and need some diagnostic programs, you're in the position of someone who has waited until it's raining to buy roofing tools. At a minimum you're probably not going to do too well at evaluating your options and picking the best, most informative programs and then using them well; instead you're going to grab the first thing that looks like it might help and hammer on your problem some. If this doesn't work, grab the next script and see if it does any better. Repeat until you come up with something or the problem goes away on its own.
This situation may sound crazy but I think it's unfortunately a natural thing to have happen. If you don't currently have performance issues, it doesn't seem very urgent to spend limited time finding and playing around with performance analysis tools; you likely have plenty of higher priority things on your to-do list, things that either have to be done or that have high payoffs. Such low priority playing around is generally seen as a spare time activity (which in practice means it almost never gets done). This is certainly what happened here with me; I always knew that there might be interesting performance analysis tools available for Solaris but it never seemed sufficiently urgent to go investigate them and separate the wheat from the chaff, since I always had more important and engaging things to do.
What my experience today has rubbed my nose into is that this can easily be a false economy. The right thing to measure against isn't what else you could be doing with your time right now, it's how much time you would lose if (or when) you have performance problems and hadn't prepared ahead of time. Now is the right time to work out what tools you have available, how well they work, and how to use them; indeed, it may be the only time you'll have to do it well. Waiting until a crisis is too late. Preparing in advance is the smart thing to do (and it sounds so obvious when I write it like that).
(All of this is nice talk but I have no idea if I will be able to carry through with it, especially since trying to evaluate performance analysis tools without a performance problem is something that I usually find kind of tedious.)
PS: our ZFS fileserver issues turned out to have a somewhat interesting root cause that pretty much went away on its own.