2012-09-24
The wrong way to harvest system-level performance stats
I've recently been looking at packages to harvest system-level performance stats (because this is something that we should really be doing), and in the process I've repeatedly observed an anti-pattern that I need to rant about now.
One popular implementation technique is to have a central master program that runs 'plugin' scripts (or programs) to gather the specific stats. Each time tick the master program runs all the plugin programs, each of which is supposed to spit out some statistics that the master winds up forwarding to wherever. This model is touted as being flexible yet powerful; to add some new stats to track, you just write a script and drop it in the right directory.
Unfortunately you cannot do real stats gathering this way, not unless you are prepared to offload a significant amount of work to your stats backend. The problem is that too many interesting stats must be computed from the difference between two measurements. At the system level, such delta stats are simply presented as a count of events since a start time (which is very simple to implement). If you want to compute a rate, you need to look at the change in the counts over a time interval. This fundamentally requires an ongoing process, not a 'sample every time tick' one-shot script.
Delta stats are important and common. They occur all over Linux (many system performance stats are exported by the kernel this way) and you also find them in things like Solaris's network stats (and probably in other Solaris performance stats, but I haven't looked that closely). Given the simplicity of the basic stats-exporting interfaces, I'd expect to find them in pretty much any OS.
You can make delta stats work in a run-every-tick environment, but it requires significant extra work and generally an auxiliary database in some form. But it is generally trying to hammer a round peg into a square hole; it can be done, but you're forcing things instead of using the right tool for the job. Unfortunately for the nice simple 'run command X every so often and give me output' stats gathering model, it is fundamentally the wrong model. The right approach for stats harvesting is a constantly-running program that has the specific stats gathering directly embedded into itself somehow.