Nerving myself up to running experimental setups in production

February 24, 2014

One of the things that I want to do is move towards gathering OS level performance metrics for our systems, ideally for basically any performance stat that we can collect. All of the IO stats for all disks? Lots of stats for NFS mounts? CPU and memory utilization? Network link utilization and error counts? Bring them on, because the modern view is that you never know when this stuff will be useful or show you something interesting. The good news is that this is not a novel idea and there's a decent number of systems out there for doing all of the pieces of this sort of thing (collecting the stats on machines, forwarding them to a central place, aggregating and collating everything, graphing and querying them, etc). The bad news, in a sense, is that I don't know what we're doing here.

Like many places, we like everything we run in production to be fully baked. We work out all of the pieces in advance with whatever experimentation is needed, test it all, document it, and then put the finalized real version into production. We don't like to be constantly changing, adjusting, and rethinking things that are in production; that's a sign that we screwed up in the pre-production steps. Unfortunately it's become obvious to me that I can't make this approach work for the whole stats gathering project.

Oh, I can build a test stats collection server and some test machines to feed it data and make sure that all of the basic bits work, and I can test the 'production' version with less important and more peripheral production machines. But it's become obvious to me that really working out the best way to gather and present stats is going to take putting a stats-gathering system on real production servers and then seeing what explodes and what doesn't work for us (and what does). I simply don't think I can build a fully baked system that's ready to deploy onto our production servers in a final, unchanging configuration; I just don't know enough and I can't learn with just an artificial test environment. Instead we're going to have to put a half-baked, tentative setup on to production servers and then evolve it. There are going to be changes on the production machines, possibly drastic ones. We won't have nice build instructions and other documentation until well after the fact (once all the dust settles and we fully understand things).

As mentioned, this is not how we want to do production systems. But it's how we're going to have to do this one and I have to live with that. More than that, I have to embrace it. I have to be willing to stop trying to polish a test setup and just go, just put things on (some of) the production servers and see if it all works and then change it.

(I've sold my co-workers on this. Now I have to sell myself on it too (and stop using any number of ways to duck out of actually doing this), which is part of what this entry is about.)

Comments on this page:

This is a shameless plug, but I'm a fan of SGI's PCP:

It's open source. It's cross-platform. It does a lot of the things you'd want from performance-monitoring system - collecting, transporting, logging, and predicate-testing. There is a GUI to plot archived and live data. The only thing that's kinda lacking is rrdtool-style graph generation.

I strongly recommend it.

By Frank Ch. Eigler at 2014-02-25 10:02:56:

Thanks for mentioning PCP (speaking of the devil; our group works on it). Direct RRD support is still not there, but OTOH a shell script pmsnap is included that uses the gui pmchart in batch mode to generate graphs into image files on demand.

By dozzie at 2014-02-25 19:49:50:

Monitoring infrastructure is like wiki engine. You shouldn't treat the metrics, graphs and alerts as something to be designed from ground up, it's more like documentation pages, i.e. the content of wiki.

You simply can't predict all the things you will need graphed and correlated. In the same way you can't predict all the pages you'll need.

Monitoring systems are a whole separate story. You shouldn't feel bad for not having a good system designed before production deployment, because none of the systems in the market is good enough. You just have to start with anything and progress gradually as you find problems and missing functions.

The guys above mentioned PCP, but it's a whole big monster. It's designed to monitor all the things one way and you should bend your infrastructure to it. The same stands for Nagios, Zabbix and Zenoss, and almost all the other things out there (Cacti, Munin, you name it). You can't easily change the way of processing metrics or alerts, you can't easily derive one metric from several others (or from alerts stream, for instance). You can't easily feed several monitoring products picked arbitrarily (because X has a neat feature for alerts, but Y gives better graphs), and then, you can't easily combine them in a single dashboard. (This is addressed to some degree by my DashWiki application.)

There was a movement of Monitoring Sucks, but it doesn't seem too active in the field of designing a new paradigm. Blog posts around it are still worth reading, though.

By Kimo at 2014-02-25 23:15:33:

collectd ( is a good stats collection agent to look into. It has a plugin system with many available plugins for collecting data. You can send data to it, and it has several options for writing stats (rrd, csv, sending to other tools like graphite)

Thanks Kimo

By Frank Ch. Eigler at 2014-02-26 11:07:15:

dozzie, one incorrect impression one gets from your comparison of PCP to the other tools is to view it as a closed data-roach-motel system. But, unlike many of the others, all data that goes into the system can easily come back out intact, for interfacing to other systems. (Just the other day some fellow with a few minutes on his hands put together a pcp->graphite data-pumper python script.)

Written on 24 February 2014.
« The origins of DWiki and its drifting purpose
Saying goodbye to the PHP pokers the easy way »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Feb 24 22:31:58 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.