My current choice of a performance metrics system and why I picked it

April 10, 2014

In response to my previous entries on gathering OS level performance metrics, people have left a number of comments recommending various systems for doing this. So now it's time to explain my current decision about this.

The short version: I'm planning to use graphite combined with some stats-gathering frontend, probably collectd. We may wind up wanting something more sophisticated as the web interface; we'll see.

This decision is not made from a full and careful comparison of all of the available tools with respect to what we need, partly because I don't know enough to make that comparison. Instead it's made in large part based on what seems to be popular among relatively prominent and leading edge organizations today. Put bluntly, graphite appears to be the current DevOps hotness as far as metrics goes.

That it's the popular and apparent default choice means two good things. First, given that it's used by much bigger environments than we are I can probably make it work for us, and given that the world is not full of angry muttering about how annoying and/or terrible it is it's probably not going to be particularly bad. Second, it's much more likely that such a popular tool will have a good ecology around it, that there will be people writing howtos and 'how I did this' articles for it and add on tools and so on. And indeed this seems to be the case based on my trawling of the Internet so far; I've tripped over far more stuff about graphite than about anything else and there seem to be any number of ways of collecting stats and feeding it data.

(That graphite's the popular choice also means that it's likely to be kept up to date, developed further, possibly packaged for me, and so on.)

A side benefit of this reading is that it's shown me that people are pushing metrics into a graphite-based system at relatively high rates. This is exactly what I want to do given that averages lie and the shorter period you take them over the better for avoiding some of those lies.

(I'm aware that we may run into things like disk IO limits. I'll have to see, but gathering metrics say every five or ten seconds is certainly my goal.)

Many of the alternatives are probably perfectly good and would do decently well for us. They're just somewhat more risky choices than the current big popular thing and as a result they leave me with various concerns and qualms.

Comments on this page:

I looked at Graphite but I am sceptical that it has "DevOps Hotness" if the last release was two years ago.

I'm presently thinking Zenoss because its both Open Source and there is a company behind it, so you might be able to hire support if need be, and they have a Beta where they're swapping out RRDtool for prettier stuff. I like their philosophy of trying to track the infrastructure stack to narrow the cause of an outage. Autodiscovery may help map out the environment I've inherited and it claims to have some CMDB features which is another gap I have to address.

Haven't had a chance to kick the tires just yet. I'll be curious to see how it goes for you with Graphite.

By ajmaidak at 2014-05-02 20:35:44:

I like munin. Writing custom plugins for it is actually kind of fun.

Written on 10 April 2014.
« Pragmatic reactions to a possible SSL private key compromise
What sort of kernel command line arguments Fedora 20's dracut seems to want »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Apr 10 01:01:20 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.