Should you alert on the glaringly obvious?

December 17, 2012

First off, I will say that part of this question is due to a peculiarity of the academic environment that I work in; we don't (at least officially) do anything outside of the working day. This creates a category of system problems that are glaringly obvious. If we're at our computers at all, we're going to notice when they happen.

(All of these are actionable alerts, things that we need to act on.)

Which brings me around to the question of whether our alerting system should generate alerts for these glaringly obvious problems. As I see it, there is one argument against generating the alerts and one and a half for.

The argument against generating the alerts is that they're both unnecessary and potentially distracting in the resulting crisis. By definition the glaringly obvious is something that you notice, and the last thing you need in the middle of a problem is to be hit by more noise in the form of your alert system telling you what you already know.

(This is especially dangerous if your alert system is going to be very noisy about a glaringly obvious problem. At that point it becomes quite easy to miss other messages or to overlook alerts about other things that are also going wrong.)

On the other hand, generated alerts create a marker (at least if done well). When you go back later for post-facto analysis the alerts can tell you when things started happening and when they stopped, which is information that you're probably not going to meticulously note down in the middle of a crisis. You can deal with the noise problem by keeping the alerts as quiet as possible (no email or paging, for example, just red markers on your dashboard).

Finally, the half point is the question of whether what you expect to be glaringly obvious actually will be. A total catastrophe probably will be, but smaller failures might be overlooked under at least some circumstances. Relatedly, having alerts for the glaringly obvious may speed up your troubleshooting because alerts effectively check a whole bunch of possibilities at once for you. Are DNS names suddenly not resolving a problem with your local DNS servers or a problem with your network link? Alerting may tell you immediately.

(The degenerate case of this is after-hours alerting, where you aren't in the office to notice the glaringly obvious.)

I don't have any handy answers to this question, it's just an issue that I want to note down and think about. I do think that the better your system deals with the alerting dependency problem the easier it is to alert on the glaringly obvious, because you get less noise from such massive failures.

Written on 17 December 2012.
« Alerts should be actionable (and the three sorts of 'alerts')
Why I'm still using VMware »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Dec 17 01:00:36 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.