Applying low distraction design to alerting systems
Writing yesterday's entry has left me with some thoughts on creating low-distraction alerting and monitoring systems. Obviously this should only include informative monitoring, but once you've got that you still need to present the information on what alerts are active in a good way. And because you want sysadmins to check your alerts page relatively frequently, you want it to be low distraction in the same way that email checks should be.
A low distraction system needs to show you enough information for you to make at least a preliminary decision, present events in some useful order, and let you shut it up. So, what I think you want is:
- a display that is organized by severity of alarm and reverse
chronological order within that, with the most recent alarm on
top and thus the most visible, with either the ages or the start
- some sort of one-line summary of each alarm's specific details,
so that you don't have to drill down further to find out what the
actual problem is.
- a way of hiding or dismissing specific alarms. Probably you should have a way of canceling this and re-revealing all current alarms.
For an added bonus, default to aggregating alarms together in some way if they are chronologically close enough (with an option to expand out the full details). This provides a natural way to condense cascade failures down into a single alert, crudely solving the alerting dependency problem.
Intuitively, I think that by priority and then chronologically is the right order to sort events into. In most situations I care more about a recent issue than an older one (after all, if the systems haven't entirely melted down by now the older issue can probably wait a bit longer), and more about an older high priority issue than a newer lower priority one. This is arguable and may depend on local circumstances.
(And the priorities may involve things like 'what machine is this reported on', with some machines being much more important than others.)