My goals for gathering performance metrics and statistics

April 8, 2014

I've written before that one of my projects is putting together something to gather OS level performance metrics. Today I want to write down what my goals for this are. First off I should mention that this is purely for monitoring, not for alerting; we have a completely separate system for that.

The most important thing is to get visibility into what's going on with our fileservers and their iSCSI backends, because this is the center of our environment. We want at least IO performance numbers on the backends, network utilization and error counts on the backends and the fileservers, perceived IO performance for the iSCSI disks on the fileservers, ZFS level stats on the fileservers, CPU utilization information everywhere, and as many NFS level stats as we can conveniently get (in a first iteration this may amount to 'none'). I'd like like to have both a very long history (half a year or more would be great) and relatively fine-grained measurements, but in practice we're unlikely to need fine-grained measurements very far into the past. To put it one way, we're unlikely to try to troubleshoot in detail a performance issue that's more than a week or so old. At the same time it's important to be able to look back and say 'were things as bad as this N months ago or did they quietly get worse on us?', because we have totally had that happen. Long term stats are also a good way to notice a disk that starts to quietly decay.

(In general I expect us to look more at history than at live data. In a live incident we'll probably go directly to iostat, DTrace, and so on.)

Next most important is OS performance information for a few crucial Ubuntu NFS clients such as our IMAP servers and our Samba servers (things like local IO, NFS IO, network performance, and oh sure CPU and memory stats too). These are very 'hot' machines, used by a lot of people, so if they have performance problems we want to know about it and have a good shot at tracking things down. Also, this sort of information is probably going to help for capacity planning, which means that we probably also want to track some application level stats if possible (eg the number of active IMAP connections). As with fileservers a long history is useful here.

Beyond that it would be nice to get the same performance stats from basically all of our Ubuntu NFS clients. If nothing else this could be used to answer questions like 'do people ever use our compute servers for IO intensive jobs' and to notice any servers with surprisingly high network IO that might be priorities for moving from 1G to 10G networking. Our general Ubuntu machines can presumably reuse much or all of the code and configuration from the crucial Ubuntu machines, so this should be relatively easy.

In terms of displaying the results, I think that the most important thing will be an easy way of doing ad-hoc graphs and queries. We're unlikely to wind up with any particular fixed dashboard that we look at to check for problems; as mentioned, alerting is another system entirely. I expect us to use this metrics system more to answer questions like 'what sort of peak and sustained IO rates do we typically see during nightly backups' or 'is any backend disk running visibly slower than the others'.

I understand that some systems can ingest various sorts of logs, such as syslog and Apache logs. This isn't something that we'd do initially (just getting a performance metrics system off the ground will be a big enough project by itself). The most useful thing to have for problem correlation purposes would be markers for when client kernels report NFS problems, and setting up an entire log ingestion system for that seems a bit overkill.

(There are a lot of neat things we could do with smart log processing if we had enough time and energy, but my guess is that a lot of them aren't really related to gathering and looking at performance metrics.)

Note that all of this is relatively backwards from how you would do it in many environments, where you'd start from application level metrics and drill downwards from there because what's ultimately important is how the application performs. Because we're basically just a provider of vague general computing services to the department, we work from the bottom up and have relatively little 'application' level metrics we can monitor.

(With that said, it certainly would be nice to have some sort of metrics on how responsive and fast the IMAP and Samba servers were for users and so on. I just don't know if we can do very much about that, especially in an initial project.)

PS: There are of course a lot of other things we could gather metrics for and then throw into the system. I'm focusing here on what I want to do first and for the likely biggest payoff. Hopefully this will help me get over the scariness of uncertainty and actually get somewhere on this.


Comments on this page:

By Anonymous at 2014-04-08 02:47:14:

Have you come across Performance Co-Pilot? It would allow inspecting both historical and live data and it seems to have some level of NFS support but I haven't tested NFS side myself yet.

http://www.performancecopilot.org/

I have to mention SGI's Performance Co-Pilot again. (I mentioned it in a comment a few months ago.) In your case, I'd set it up to collect data every 5 mins (or whatever interval you want) and rotate the archives every month. You can either log locally or via TCP. The logs don't lose resolution over time (like rrdtool) so you always have the same resolution. The logs compress rather well, so you could get away with storing years of logs on a compressed ZFS dataset.

I don't know if the graphing front changed in the past couple of years, but that used to be the worst part of PCP. It has pmchart which is an interactive GUI app that lets you explore logs, but nothing to generate (pretty) static images with archive data. There have been a number of changes since that may make dashboard-making easier, I just don't know.

You can easily run it on both Linux and Solaris/OmniOS and get at hundreds of metrics without any effort.

By dozzie at 2014-04-08 11:20:02:

...and how much trouble is to setup PCP? As I remember, PCP comes with scary 160-page instruction.

Chris, you could start with Graphite. It shouldn't be difficult to install[*], is ridiculously easy to start collecting data (just open a socket to graphite:2003 and send a line "foo.bar.baz 10 1396968142\n" (metric's tag, the metric itself and its timestamp)) and graphite-web gives an easy start for plotting the graphs.

All you need running is:

  • carbon-cache
  • graphite-web (quite typical Django application)
  • some script to collect stats and to send them to carbon-cache

[*] I have just spent about 30 minutes backporting graphite-web, graphite-carbon and python-whisper packages from Debian unstable to Debian oldstable. I took some drastic shortcuts, like assuming Django 1.2 will make up for Django 1.6+ (it mostly did) or ignoring libjs-jquery-flut package whatsoever (it bit me in Graphlot part, but it doesn't seem required to just use graphite-web), but the whole thing seems to just work. On Ubuntu Saucy it should be even easier, since all the packages are in universe repository.

What do you think, Chris? Can you spare half an hour to setup Graphite? You can even skip starting regular WWW server if you use uWSGI with config like this:

; graphite-web.ini
[uwsgi]
uid = _graphite
gid = _graphite
plugins = python,http
http = 0.0.0.0:8180
mount = /=/usr/share/graphite-web/graphite.wsgi
By erlogan at 2014-04-08 14:12:26:

The venerable Cacti has been my tool of choice for this type of thing in the past. Cacti is probably best working from SNMP queries, but it's fairly straightforward to get it to track and plot arbitrary data generated from a script. I used NRPE to run reporting scripts remotely (since I was already using nagios), but you could just as easily use something like xinetd.

Hi Chris,

I second the use of graphite and statsd. It's a good central tool for gathering and displaying stats from all sorts of services.

One source that seems to work pretty well for us is using collectd on the servers but rather than keeping the stats locally in rrd sending them all back to a central graphite server.

You can then add in other metrics from other systems, even just adhoc scripts on a box sending back to graphite using netcat.

The interface is a little quirky but very powerful and we use the dashboards quite a bit for storing groups of graphs. There are a load of other dashboards that you can hook up to it to or just dump the data that would make the graph out and process it yourself.

I realize I'm commenting a bit too late. (Since Chris already decided to use graphite.)

PCP is very easy to set up. You just install the packages and you have a base system all set up: http://blahg.josefsipek.net/test/?p=437 & http://blahg.josefsipek.net/test/?p=438

I'm perplexed by the implication that well documented software is harder to set up. Had you not discovered the documentation, would you have thought that PCP was easier to set up?

Yes, I have to admit that feeding new metrics (that no one has implemented a PMDA for) is more complicated than in graphite - you have to write a PMDA. Nowadays you have a choice between Perl, C, and Python (IIRC it's stable now) and since there are existing PMDAs, examples and documentation, you can do so very easily. (Once upon a time, I wrote a gpsd PMDA because I wanted to use PCP to log my coordinates during a road trip.)

I don't know how easy it is to install collectd. Well, I'd expect installing it on Linux to be trivial. How about on Solaris 10 or OmniOS? PCP just works on those :)

I think it is a shame that PCP isn't more well known. I've used it for long-term system monitoring (5-min logging interval) as well as debugging performance issues (1-millisecond logging interval). The logs of course grow faster the more often you log.

I'm not really familiar with graphite, but I wonder how hard it'd be to combine the cross-platform logging awesomeness of PCP with the graphing ability of graphite.

By cks at 2014-04-10 12:27:58:

My understanding is that graphite is simply a backend that handles receiving, storing, and displaying timeseries data, so if you can generate things that are 'timestamp, metric name, value' you can send this into graphite. Normally you'll run a single graphite (logical) instance to accept these timeseries from all of your machines and other stats generation sources, instead of one per machine. Since this is close to the fundamental data that PCP deals with, you can presumably get the raw-ish data out of PCP somehow and feed it to graphite with sufficient hacks.

There's actually OmniOS packages for relatively current versions of collectd, which I was pleased to see. Otherwise you get to install it from source.

Written on 08 April 2014.
« Giving in: pragmatic If-Modified-Since handling for Tiny Tiny RSS
Pragmatic reactions to a possible SSL private key compromise »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Apr 8 00:45:11 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.