One of the reasons good alerting is tough
June 13, 2009
One of the reasons that alerting is a tough problem to solve well is what I'll call the dependency problem. It goes like this: imagine that you have a nice monitoring system and it's keeping track of all sorts of things in your environment. One day you get a huge string of alerts, reporting that server after server is down. Oh, and also a network switch isn't responding.
Of course, the real problem is that the switch has died. It's being camouflaged behind a barrage of spurious alerts about all of the servers behind it, which are no longer reachable and look just like they've crashed too. This is the alerting dependency problem; the fact that the objects you're monitoring aren't independent, they're interconnected. Reporting everything as if they were independent produces results that are not necessarily very productive, especially during major failures.
The obvious but useless solution to this is that you should configure the service dependencies when you add a new thing to be monitored. This has at least two problems. First, sysadmins are just as lazy as everyone else, especially when they're busy to start with. Second, this dependency information is subject to the problem that sooner or later, any information that doesn't have to be correct for the system to work won't be. Perhaps someone will make a mistake when adding or changing things, or maybe someone will forget to update the monitoring system when a machine is moved, and so on.
(One way to look at this is that the dependency information is effectively comprehensive documentation on how your systems are organized and connected. If this is not something you're already doing, there's no reason to think that the documentation problem is going to be any more tractable when it's done through your monitoring system. If you are already doing this, congratulations.)
So, really, a good alerting system needs to understand a fair bit about system dependencies and be able to automatically deduce or infer as many as possible, so that it can give you sensible problem reports. This is, as they say, a non-trivial problem.
(Bad alerting systems descend to fault reporting.)
* * *
Atom feeds are available; see the bottom of most pages.