The wrong way to harvest system-level performance stats

September 24, 2012

I've recently been looking at packages to harvest system-level performance stats (because this is something that we should really be doing), and in the process I've repeatedly observed an anti-pattern that I need to rant about now.

One popular implementation technique is to have a central master program that runs 'plugin' scripts (or programs) to gather the specific stats. Each time tick the master program runs all the plugin programs, each of which is supposed to spit out some statistics that the master winds up forwarding to wherever. This model is touted as being flexible yet powerful; to add some new stats to track, you just write a script and drop it in the right directory.

Unfortunately you cannot do real stats gathering this way, not unless you are prepared to offload a significant amount of work to your stats backend. The problem is that too many interesting stats must be computed from the difference between two measurements. At the system level, such delta stats are simply presented as a count of events since a start time (which is very simple to implement). If you want to compute a rate, you need to look at the change in the counts over a time interval. This fundamentally requires an ongoing process, not a 'sample every time tick' one-shot script.

Delta stats are important and common. They occur all over Linux (many system performance stats are exported by the kernel this way) and you also find them in things like Solaris's network stats (and probably in other Solaris performance stats, but I haven't looked that closely). Given the simplicity of the basic stats-exporting interfaces, I'd expect to find them in pretty much any OS.

You can make delta stats work in a run-every-tick environment, but it requires significant extra work and generally an auxiliary database in some form. But it is generally trying to hammer a round peg into a square hole; it can be done, but you're forcing things instead of using the right tool for the job. Unfortunately for the nice simple 'run command X every so often and give me output' stats gathering model, it is fundamentally the wrong model. The right approach for stats harvesting is a constantly-running program that has the specific stats gathering directly embedded into itself somehow.

Comments on this page:

From at 2012-09-24 03:14:30:

Maybe my morning coffee hasn't kicked in yet, but I don't get why "This fundamentally requires an ongoing process". If you run a script at t1 and it outputs v1, and you run a script at t2 and it outputs v2, you can compute a rate, no?

What I think you mean is that if a stats monitoring system can only use values produced by the scripts directly, and can't compute rates for you, then you need to make your script compute the rate. And it can't do that if it simply returns a kernel counter. One way to resolve this would be to switch to a model of ongoing stats collection processes; such a process can compute a rate simply because it can hold v[n-1] and t[n-1] in memory. But I can imagine another approach which does not require an ongoing process: hold that previous pair in a file somewhere (e.g. under /var/lib/blah) and read and overwrite that file each time the script runs.

But there's another approach, which feels to me like the correct one: Treat computing rates as a presentation issue. The stats monitoring system should receive and store the underlying counter values. By knowing (or allowing the user to configure) which series represent delta stats, it can computer the rate series from the underlying data when it draws graphs orconsiders whether to fire alerts, etc.

From at 2012-09-24 09:29:44:

Personally I use collectd to shoot the raw counters over to Graphite and then on display use Graphite's derivative function to show me change over time.

I am of the belief that you should store the raw data and manipulate it later to fit your purposes.

From at 2012-09-24 09:58:39:

Performance data presentation transformation does not belong at the OS level. There is a rightful place for that unknown transformation to take place, and it sits between the OS and your eyes. Anything lower-level than that is making likely-wrong assumptions about my needs.

I can't really get myself to believe that you think otherwise.

By cks at 2012-09-24 11:27:16:

I don't think that the OS should be doing the delta transformation; there are far too many problems with that. I think it belongs in the basic programs that harvest stats from the OS and either present them to the user or ship them off to higher-level systems like graphing or monitoring software.

So why not doing all of the delta computation in the graphing and monitoring software? My answer is that there are two issues that are hard to deal with in the backend software: timestamps and counter rollover and other errors. With fine-grained stats, it isn't good enough to timestamp the raw measurements when the backend receives them from the collector; the collector really needs to ship a pair of timestamp and measure off to the backend, and then the backend needs to use the timestamps. You simply can't assume that the delay between reading the measure from the system in the collector and the backend receiving the measure is constant or that any changes in it are unimportant.

Raw measure counters do overflow and experience other glitches; sometimes this happens quite frequently (a 32-bit byte count on a gigabit network card will roll over in less than a minute of full-rate traffic, for example). Something has to recognize and handle this, and I believe that doing it in the backend significantly complicates the backend. It also slows down many derivative-based computations, since the backend must now scan all data points between the start and end points to find counter rollovers.

(A special case of counter rollovers is system reboots, which restart all of these running counters from zero. This means that all such counters 'roll over' eventually and you need to handle this.)

From at 2012-09-26 11:27:46:

I certainly agree that anything timestamping the metric data as it reaches the destination is completely broken. What tools do that?

Your other point about counter rollover is real. I guess I don't see the need to solve the problem at every host agent as opposed to at a policy-configurable endpoint. I'm sure there's a good analogous debate about application logic vs. database stored procedure logic to gleen from here.

By cks at 2012-09-27 12:37:51:

You make a fair point about the timestamps; I don't actually know how common tools like Graphite and so on deal with timestamps. It may well be that everyone already receives and stores timestamp plus measure pairs and this is a non-issue.

I think that host agents are the right place to deal with counter rollover for two reasons. First, the host agent is the natural place to put measure-specific information about what counter rollovers and other glitches actually look like. Second, if the backend can assume that the measure data is rollover-free, any delta computations are much simpler. It can simply take the start and end points, subtract one from the other (for both the delta time and the change in the measure), and be done. If the backend handles rollovers, all delta computations must scan all of the intermediate data points in order to spot and fix up any rollovers.

From at 2012-09-28 01:43:13:

Rollovers are easy enough to deal with, it's the double rollover that gets you.

Still, if you are counting the bits transferred on a heavily loaded gigabit interface, you'd realise pretty soon that you can't realistically use a 32bit counter.

I agree with the other commentators that deltas are best dealt with outside of the monitoring system.

Monitoring and collection must be as lightweight as possible so it adds as little load as possible with performance of the underlying system. This is why DTrace in Solaris 10 is so awesome. Kernel level instrumentation for everything at next to 0.0% load.


By cks at 2012-09-28 08:55:44:

The raw counters come from the operating system, limitations included. I entirely agree with you about 32-bit counters not being a good idea, but I've still seen OSes and drivers that did that (if I remember right, it was driver specific).

Written on 24 September 2012.
« How we handle Ubuntu LTS versions
The jaundiced C programmer's view of object orientation »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Sep 24 01:08:50 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.