One of the reasons good alerting is tough

June 13, 2009

One of the reasons that alerting is a tough problem to solve well is what I'll call the dependency problem. It goes like this: imagine that you have a nice monitoring system and it's keeping track of all sorts of things in your environment. One day you get a huge string of alerts, reporting that server after server is down. Oh, and also a network switch isn't responding.

Of course, the real problem is that the switch has died. It's being camouflaged behind a barrage of spurious alerts about all of the servers behind it, which are no longer reachable and look just like they've crashed too. This is the alerting dependency problem; the fact that the objects you're monitoring aren't independent, they're interconnected. Reporting everything as if they were independent produces results that are not necessarily very productive, especially during major failures.

The obvious but useless solution to this is that you should configure the service dependencies when you add a new thing to be monitored. This has at least two problems. First, sysadmins are just as lazy as everyone else, especially when they're busy to start with. Second, this dependency information is subject to the problem that sooner or later, any information that doesn't have to be correct for the system to work won't be. Perhaps someone will make a mistake when adding or changing things, or maybe someone will forget to update the monitoring system when a machine is moved, and so on.

(One way to look at this is that the dependency information is effectively comprehensive documentation on how your systems are organized and connected. If this is not something you're already doing, there's no reason to think that the documentation problem is going to be any more tractable when it's done through your monitoring system. If you are already doing this, congratulations.)

So, really, a good alerting system needs to understand a fair bit about system dependencies and be able to automatically deduce or infer as many as possible, so that it can give you sensible problem reports. This is, as they say, a non-trivial problem.

(Bad alerting systems descend to fault reporting.)


Comments on this page:

From 217.44.35.86 at 2009-06-13 04:55:05:

Yup. I've been musing about this (and related issues) for a while. Keeping related systems (CMDB, monitoring, assets) up to date either needs good change control procedures or help from the systems themselves (in fact probably both).

I wrote some of this up (starting from the problem of why asset registers tend not to be updated) at http://blogs.ncl.ac.uk/blogs/index.php/paul.haldane/2008/10/12/asset_registers

The ideal is a system which notices when a machine's connection is moved from one switch to another and updates all relevant system (either asking for confirmation first or telling the humans afterwards depending on how confident you feel).

Paul Haldane

From 67.71.28.99 at 2009-06-13 07:33:47:

I've been mulling over this too. The first step is to try and have something automatically deduce relationships between things. The next thing I've been experimenting with, is if several systems all become alertable within 2 reporting periods, scan through an issue tracker looking for an issue that had the same set of failures, if one is found, while alerting attach the ticket as a "This is possibly a reoccurance of this type of fault, and this should contain what the problem was and how it was fixed last time".

People then can assign a whole string of alerts to one issue (either reopening an issue that was suggested by the system, or creating one and assigning the alerts that have occured to it). Then as long as the alerts are outstanding, they are all grouped into that one issue. You then have a track of how issues were handled, and your alerting system now becomes useful when you try and determine what the root cause failure actually is, and what you need to do to fix it. (Did you switch fail because the firmware version on that switch has a known once-every-couple-of-months lockup bug, but the vendor hasn't yet givin you the replacement firmware?)

Perry Lorier

From 68.36.54.173 at 2009-06-13 09:02:31:

I have this dream for monitoring on my network...

My current setup is this: I've got two locations, the primary and the backup. It's a warm backup currently, requiring manual intervention to go live. It does feature its own Nagios server, however. Each site's Nagios server is setup as a full live server, not as a slave to the other. Each one monitors all local machines, plus all network and VPN connections up to the gateway of the other site, plus the remote nagios host, to make sure nagios is up and running.

My dream is to have each one of the nagios machines connected to a phone line & modem. Each machine should call the other every, oh, I don't know, 12 hours or so, and verify that the other could answer the phone and is listening.

Because we've got a reliable telephone solution at that point, we can add another reporting method. In the event of a critical server being down, the local nagios instance can phone me direct to let me know.

I haven't decided if the ideal would be festival or pre-recorded wav's, but either way would work alright, I think. At least well enough to let me know that something was up.

Matt Simmons
http://standalone-sysadmin.blogspot.com

Written on 13 June 2009.
« What I know about Solaris 10 NFS server file lock limits
How to set up your vacation messages to get thrown off mailing lists »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jun 13 04:07:08 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.