The general lesson from the need for metrics

December 10, 2012

The lesson I learned about why metrics are important is an important lesson, but it's a specific lesson. It would be a shame to stop there, because there is a general lesson lurking in the underbrush behind it. That is:

Fallible humans are always going to overlook something.

This is the real lesson of fragile complexity, in all its various specific facets. Our systems are too complex for us to genuinely understand, and that complexity means we are always going to overlook something (and sooner or later that something will matter).

One of the things we need to do in system administration is to engineer large scale, high level approaches to our problems that can deal with this messy realization and that do not depend on post-facto specific fixes. It's always tempting to apply post-facto fixes, to say things like 'I'll make sure to check for performance problems after future changes to our fileserver infrastructure', but this is never going to be good enough. Even apart from the pragmatic issues pointed out by Perry Lorier in a comment, this is a fundamentally backwards looking solution; it deals with the problem we found this time around but it doesn't necessarily deal with a future problem.

This is the generalized reason for automated metrics collection and monitoring. If you gather metrics you're constructing a backstop for human fallibility. If and when something goes wrong because of something people overlooked, you have a chance to see it and catch it before things explode, a chance that you would not have if you relied purely on post-facto fixes.

A direct corollary of this is that it's important to gather all the metrics that you can, even for things that you don't think you have any use for. Gathering only metrics you have a use for now is a backwards looking solution; you're assuming that you know what you need. Fragile complexity says that you're wrong, you don't know yet what you're going to want to spot the next problem, a problem that you didn't even foresee being possible. So gather everything you can. That way you have a chance to beat the future.

Written on 10 December 2012.
« Things that systemd gets right
One good use for default function arguments »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Dec 10 23:07:40 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.