My current choice of a performance metrics system and why I picked it
In response to my previous entries on gathering OS level performance metrics, people have left a number of comments recommending various systems for doing this. So now it's time to explain my current decision about this.
The short version: I'm planning to use graphite combined with some stats-gathering frontend, probably collectd. We may wind up wanting something more sophisticated as the web interface; we'll see.
This decision is not made from a full and careful comparison of all of the available tools with respect to what we need, partly because I don't know enough to make that comparison. Instead it's made in large part based on what seems to be popular among relatively prominent and leading edge organizations today. Put bluntly, graphite appears to be the current DevOps hotness as far as metrics goes.
That it's the popular and apparent default choice means two good things. First, given that it's used by much bigger environments than we are I can probably make it work for us, and given that the world is not full of angry muttering about how annoying and/or terrible it is it's probably not going to be particularly bad. Second, it's much more likely that such a popular tool will have a good ecology around it, that there will be people writing howtos and 'how I did this' articles for it and add on tools and so on. And indeed this seems to be the case based on my trawling of the Internet so far; I've tripped over far more stuff about graphite than about anything else and there seem to be any number of ways of collecting stats and feeding it data.
(That graphite's the popular choice also means that it's likely to be kept up to date, developed further, possibly packaged for me, and so on.)
A side benefit of this reading is that it's shown me that people are pushing metrics into a graphite-based system at relatively high rates. This is exactly what I want to do given that averages lie and the shorter period you take them over the better for avoiding some of those lies.
(I'm aware that we may run into things like disk IO limits. I'll have to see, but gathering metrics say every five or ten seconds is certainly my goal.)
Many of the alternatives are probably perfectly good and would do decently well for us. They're just somewhat more risky choices than the current big popular thing and as a result they leave me with various concerns and qualms.