Monitoring systems should be always be usefully informative

June 8, 2009

There is a fashion in monitoring systems that, once you have one, you start monitoring and alerting on everything that you possibly can think of, no matter what it is. If you can measure it, you do, and when it gets too big or too small your system lets people know about it.

This is, by and large, a mistake.

It is a mistake because you've created a system that isn't actually (usefully) informative, just noisy. What your monitoring system should be telling you about is real problems that you can do something about, not things that either aren't real problems or aren't problems that you can do anything about.

(Take, for example, the ever popular monitoring of free user disk space. At least around here, there is nothing we can do it if a user filesystem runs out of space; we can neither go in and remove user files to get space back nor add more space that the users haven't paid for.)

The less noise your monitoring system has, the more likely people are to look at it and actually pay attention if it has trouble indicators. A monitoring system that always shows trouble indicators is about as useful as a fire alarm that is on all the time (although probably less annoying).

Yes, yes, people can learn to ignore 'known harmless' trouble indicators. The problem is that this takes mental work, which means that it takes more effort to check the monitoring system, which means people do it less often or pay less attention to it or both. It also means that you cannot look at a top level summary and get anything useful from it, because the overall system is never in 'all green' condition. And having something that you can quickly glance at to look for problems is a significant win.

Sidebar: the case for widespread monitoring

There is a case for tracking everything you can, provided that your monitoring system keeps history and can display 'measure over time' graphs or the like. Then what you're doing is getting statistics, which is vital. But if you're tracking things for statistics, you should not alert on them.

So by all means track user disk space usage, so that you can draw nice graphs about six month trends in space growth that clearly justify another shelf of disks. Just don't alert on them.

(This is one area where canned monitoring systems are your friends, because they have probably already got systems to keep lots of history of random measurements and graph them for you.)

Written on 08 June 2009.
« It's important to get the real costs right
Another way that generators are not lists: modifying them »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jun 8 01:45:17 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.