System metrics need to be documented, not just to exist

October 13, 2014

As a system administrator, I love systems that expose metrics (performance, health, status, whatever they are). But there's a big caveat to that, which is that metrics don't really exist until they're meaningfully documented. Sadly, documenting your metrics is much less common than simply exposing them, perhaps because it takes much more work.

At the best of times this forces system administrators and other bystanders to reverse engineer your metrics from your system's source code or from programs that you or other people write to report on them. At the worst this makes your metrics effectively useless; sysadmins can see the numbers and see them change, but they have very little idea of what they mean.

(Maybe sysadmins can dump them into a stats tracking system and look for correlations.)

Forcing people to reverse engineer the meaning of your stats has two bad effects. The obvious one is that people almost always wind up duplicating this work, which is just wasted effort. The subtle one is that it is terribly easy for a mistake about what the metrics means to become, essentially, superstition that everyone knows and spreads. Because people are reverse engineering things in the first place, it's very easy for mistakes and misunderstandings to happen; then people write the mistake down or embody it in a useful program and pretty soon it is being passed around the Internet since it's one of the few resources on the stats that exist. One mistake will be propagated into dozens of useful programs, various blog posts, and so on, and through the magic of the Internet many of these secondary sources will come off as unhesitatingly authoritative. At that point, good luck getting any sort of correction out into the Internet (if you even notice that people are misinterpreting your stats).

At this point some people will suggest that sysadmins should avoid doing anything with stats that they reverse engineer unless they are absolutely, utterly sure that they're correct. I'm sorry, life doesn't work this way. Very few sysadmins reverse engineer stats for fun; instead, we're doing it to solve problems. If our reverse engineering solves our problems and appears sane, many sysadmins are going to share their tools and what they've learned. It's what people do these days; we write blog posts, we answer questions on Stackoverflow, we put up Github repos with 'here, these are the tools that worked for me'. And all of those things flow around the Internet.

(Also, the suggestion that people should not write tools or write up documentation unless they are absolutely sure that they are correct is essentially equivalent to asking people not to do this at all. To be absolutely sure that you're right about a statistic, you generally need to fully understand the code. That's what they call rather uncommon.)

Written on 13 October 2014.
« Phish spammers are apparently exploiting mailing list software
Bashisms in #!/bin/sh scripts are not necessarily bugs »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Oct 13 01:24:54 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.