My goals for gathering performance metrics and statistics
I've written before that one of my projects is putting together something to gather OS level performance metrics. Today I want to write down what my goals for this are. First off I should mention that this is purely for monitoring, not for alerting; we have a completely separate system for that.
The most important thing is to get visibility into what's going on with our fileservers and their iSCSI backends, because this is the center of our environment. We want at least IO performance numbers on the backends, network utilization and error counts on the backends and the fileservers, perceived IO performance for the iSCSI disks on the fileservers, ZFS level stats on the fileservers, CPU utilization information everywhere, and as many NFS level stats as we can conveniently get (in a first iteration this may amount to 'none'). I'd like like to have both a very long history (half a year or more would be great) and relatively fine-grained measurements, but in practice we're unlikely to need fine-grained measurements very far into the past. To put it one way, we're unlikely to try to troubleshoot in detail a performance issue that's more than a week or so old. At the same time it's important to be able to look back and say 'were things as bad as this N months ago or did they quietly get worse on us?', because we have totally had that happen. Long term stats are also a good way to notice a disk that starts to quietly decay.
(In general I expect us to look more at history than at live data.
In a live incident we'll probably go directly to
and so on.)
Next most important is OS performance information for a few crucial Ubuntu NFS clients such as our IMAP servers and our Samba servers (things like local IO, NFS IO, network performance, and oh sure CPU and memory stats too). These are very 'hot' machines, used by a lot of people, so if they have performance problems we want to know about it and have a good shot at tracking things down. Also, this sort of information is probably going to help for capacity planning, which means that we probably also want to track some application level stats if possible (eg the number of active IMAP connections). As with fileservers a long history is useful here.
Beyond that it would be nice to get the same performance stats from basically all of our Ubuntu NFS clients. If nothing else this could be used to answer questions like 'do people ever use our compute servers for IO intensive jobs' and to notice any servers with surprisingly high network IO that might be priorities for moving from 1G to 10G networking. Our general Ubuntu machines can presumably reuse much or all of the code and configuration from the crucial Ubuntu machines, so this should be relatively easy.
In terms of displaying the results, I think that the most important thing will be an easy way of doing ad-hoc graphs and queries. We're unlikely to wind up with any particular fixed dashboard that we look at to check for problems; as mentioned, alerting is another system entirely. I expect us to use this metrics system more to answer questions like 'what sort of peak and sustained IO rates do we typically see during nightly backups' or 'is any backend disk running visibly slower than the others'.
I understand that some systems can ingest various sorts of logs, such as syslog and Apache logs. This isn't something that we'd do initially (just getting a performance metrics system off the ground will be a big enough project by itself). The most useful thing to have for problem correlation purposes would be markers for when client kernels report NFS problems, and setting up an entire log ingestion system for that seems a bit overkill.
(There are a lot of neat things we could do with smart log processing if we had enough time and energy, but my guess is that a lot of them aren't really related to gathering and looking at performance metrics.)
Note that all of this is relatively backwards from how you would do it in many environments, where you'd start from application level metrics and drill downwards from there because what's ultimately important is how the application performs. Because we're basically just a provider of vague general computing services to the department, we work from the bottom up and have relatively little 'application' level metrics we can monitor.
(With that said, it certainly would be nice to have some sort of metrics on how responsive and fast the IMAP and Samba servers were for users and so on. I just don't know if we can do very much about that, especially in an initial project.)
PS: There are of course a lot of other things we could gather metrics for and then throw into the system. I'm focusing here on what I want to do first and for the likely biggest payoff. Hopefully this will help me get over the scariness of uncertainty and actually get somewhere on this.