System metrics need to be documented, not just to exist

October 13, 2014

As a system administrator, I love systems that expose metrics (performance, health, status, whatever they are). But there's a big caveat to that, which is that metrics don't really exist until they're meaningfully documented. Sadly, documenting your metrics is much less common than simply exposing them, perhaps because it takes much more work.

At the best of times this forces system administrators and other bystanders to reverse engineer your metrics from your system's source code or from programs that you or other people write to report on them. At the worst this makes your metrics effectively useless; sysadmins can see the numbers and see them change, but they have very little idea of what they mean.

(Maybe sysadmins can dump them into a stats tracking system and look for correlations.)

Forcing people to reverse engineer the meaning of your stats has two bad effects. The obvious one is that people almost always wind up duplicating this work, which is just wasted effort. The subtle one is that it is terribly easy for a mistake about what the metrics means to become, essentially, superstition that everyone knows and spreads. Because people are reverse engineering things in the first place, it's very easy for mistakes and misunderstandings to happen; then people write the mistake down or embody it in a useful program and pretty soon it is being passed around the Internet since it's one of the few resources on the stats that exist. One mistake will be propagated into dozens of useful programs, various blog posts, and so on, and through the magic of the Internet many of these secondary sources will come off as unhesitatingly authoritative. At that point, good luck getting any sort of correction out into the Internet (if you even notice that people are misinterpreting your stats).

At this point some people will suggest that sysadmins should avoid doing anything with stats that they reverse engineer unless they are absolutely, utterly sure that they're correct. I'm sorry, life doesn't work this way. Very few sysadmins reverse engineer stats for fun; instead, we're doing it to solve problems. If our reverse engineering solves our problems and appears sane, many sysadmins are going to share their tools and what they've learned. It's what people do these days; we write blog posts, we answer questions on Stackoverflow, we put up Github repos with 'here, these are the tools that worked for me'. And all of those things flow around the Internet.

(Also, the suggestion that people should not write tools or write up documentation unless they are absolutely sure that they are correct is essentially equivalent to asking people not to do this at all. To be absolutely sure that you're right about a statistic, you generally need to fully understand the code. That's what they call rather uncommon.)

Comments on this page:

By Ewen McNeill at 2014-10-13 05:22:55:

For better or worse I think that part of the reason that such things are not documented is that the developers/producers of the software do not want to commit to them being part of the stable API; and there's a default assumption that anything documented is part of the stable API.

Possibly there needs to be a community acceptance of a part of the API which is "fragile" or "implementation dependent", for which all that is guaranteed is that it works in version X, and that if it doesn't work the same in version X+1 (or X.1 or X.0.1) that will be documented when the new version comes out (possibly "and someone discovers it is broken"). So you can choose to use the "works for now" part of the API if you want, but if it breaks in a later version you own both halves.

In the past this undocumented part was "used by software developer/manufacturer internally developed tools only". But as you say with everyone forced to reverse engineer things just to find out anything beyond "it's a black box", they're not going to stay "internal only" forever.


Just to clarify, what parts of a metrics collection do you think needs to be documented? That this application sends these stats out, or that we collect said stats in this manner, or how we use the collected stats.

Would you consider it enough, or just a starting point, to say "This java application sends how many users it has seen in the past 30 minutes every 30 minutes to graphite server XYZ. We display a graph of this, trending over a month, on dashboard ABC." (Of course, this particular metric can be argued as to it's value, but it's a useful marketing piece and it makes management happy to see the ticking numbers.)

By cks at 2014-10-14 00:55:42:

I'm thinking of applications and systems that make stats available in some form without saying what they mean. In Linux, for example, there's a whole collection of underdocumented stats in /proc et al (I wrote an entire series about the NFS client stats). There's many forms that this can take at both the system and application level, eg exposed in interfaces like /proc, put in log files, reported in SNMP MIBs, sent off to metrics systems with little discussion of what they mean, and so on.

From my current perspective I'd say that your example documentation is clearly enough, as it explains both what the metric means and where to find it. An example of an undocumented metric would be just sending a 'recentusers' metric off to whatever graphite server had been configured in the application's setup without explaining what 'recentusers' meant.

(Of course the application's programmers may think that calling the metric 'recentusers' is enough, but no, not really.)

Written on 13 October 2014.
« Phish spammers are apparently exploiting mailing list software
Bashisms in #!/bin/sh scripts are not necessarily bugs »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Oct 13 01:24:54 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.