sysadmin/UsefulAlertingDesign written at 00:27:48; Add Comment
Applying low distraction design to alerting systems
Writing yesterday's entry has left me with some thoughts on creating low-distraction alerting and monitoring systems. Obviously this should only include informative monitoring, but once you've got that you still need to present the information on what alerts are active in a good way. And because you want sysadmins to check your alerts page relatively frequently, you want it to be low distraction in the same way that email checks should be.
A low distraction system needs to show you enough information for you to make at least a preliminary decision, present events in some useful order, and let you shut it up. So, what I think you want is:
For an added bonus, default to aggregating alarms together in some way if they are chronologically close enough (with an option to expand out the full details). This provides a natural way to condense cascade failures down into a single alert, crudely solving the alerting dependency problem.
Intuitively, I think that by priority and then chronologically is the right order to sort events into. In most situations I care more about a recent issue than an older one (after all, if the systems haven't entirely melted down by now the older issue can probably wait a bit longer), and more about an older high priority issue than a newer lower priority one. This is arguable and may depend on local circumstances.
(And the priorities may involve things like 'what machine is this reported on', with some machines being much more important than others.)
* * *
Atom feeds are available; see the bottom of most pages.