Wandering Thoughts archives

2010-06-21

Applying low distraction design to alerting systems

Writing yesterday's entry has left me with some thoughts on creating low-distraction alerting and monitoring systems. Obviously this should only include informative monitoring, but once you've got that you still need to present the information on what alerts are active in a good way. And because you want sysadmins to check your alerts page relatively frequently, you want it to be low distraction in the same way that email checks should be.

A low distraction system needs to show you enough information for you to make at least a preliminary decision, present events in some useful order, and let you shut it up. So, what I think you want is:

  • a display that is organized by severity of alarm and reverse chronological order within that, with the most recent alarm on top and thus the most visible, with either the ages or the start times shown.

  • some sort of one-line summary of each alarm's specific details, so that you don't have to drill down further to find out what the actual problem is.

  • a way of hiding or dismissing specific alarms. Probably you should have a way of canceling this and re-revealing all current alarms.

For an added bonus, default to aggregating alarms together in some way if they are chronologically close enough (with an option to expand out the full details). This provides a natural way to condense cascade failures down into a single alert, crudely solving the alerting dependency problem.

Intuitively, I think that by priority and then chronologically is the right order to sort events into. In most situations I care more about a recent issue than an older one (after all, if the systems haven't entirely melted down by now the older issue can probably wait a bit longer), and more about an older high priority issue than a newer lower priority one. This is arguable and may depend on local circumstances.

(And the priorities may involve things like 'what machine is this reported on', with some machines being much more important than others.)

sysadmin/UsefulAlertingDesign written at 00:27:48; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.