Our (unusual) freedom to use alerts as notifications
Many guides to deciding what to alert on draw a strong distinction between alerts and less important things (call them 'notifications'). The distinction generally ultimately comes about because alerts will disturb the people who are on call outside of working hours, and that should be reserved for serious things that they can and should take action on. This is often threaded through assumptions and guidelines in metrics and alerting systems; for example, Prometheus implicitly follows this in their guide to alerting, and the philosophy document they link to assumes that alerts will page people and so should be minimized.
Our alerts in our Prometheus setup don't follow this. I've already written up our reboot notifications, which are implemented as special Prometheus alerts that explicitly call themselves 'notifications' and are handled specially, but it extends beyond this in our alerts. We generate 'alerts' for things that we merely want to keep track of; one example is our automated reboots of hung Dell C6220 blades (which alert as if a machine went down and then came back, because it did), and these alerts are just the same as we would get for any machine that went down and then came back up.
(This is also part of why we have set Alertmanager to also send us email about resolved alerts (cf). Paging people to tell them something is now over would probably not be well received.)
The reason we have this freedom is not that we've done clever design in our Prometheus setup to avoid paging people for such notifications. Instead, it's because no one is on call here and so these notification alerts are not disturbing anyone when they trigger outside of working hours (even inside of working hours, they're just another email message). If we started having people on call, we would have to change this so that only genuine alerts paged people.
(This doesn't mean that limiting what we alert on is unimportant. Even for alerts that are just notifying us about things, we have to both genuinely care about the thing and find the notification useful. Generally our notifications are kept down so that they fire only rarely unless we're having real problems, such as a lot of Dell C6220 blades crashing all the time for some reason. And we might filter those out if they started becoming ubiquitous, or perhaps take the blades out of service on the grounds that they're now too unreliable.)
This blurring of alerts and notifications is not without its hazards, most obviously if we become acclimatized to notifications and treat a real problem (an alert) as merely a less important notification that can be let sit for a bit. But it's also been important for what alerts we actually create; our freedom to 'alert' on things that are perhaps not an immediate crisis allows us to watch more things and to be less cautious and conservative about the levels of things to alert on (and accept some false positives in the name of, say, alerting early about things that are real problems).
(I mentioned our situation with alerts not paging us back in why we generate alert notifications about rebooted machines, but I didn't think or talk about the impact it's had on what we alert about. It's sort of a 'fish in water' thing; I didn't think about how it affected what we alert on until recently.)
|
|