Our alerts are quiet most of the time (as they should be)
It's the middle of the University of Toronto's winter break right now so officially we're off work. Unofficially we're checking in on email and we'll probably notice if something explodes. During this time, one of the things that has made this not much of a burden is that we basically haven't gotten any alerts from our Prometheus system. This is by design.
One part of the design is that our overall environment is not supposed to have (significant) problems, and generally doesn't. The other part of the design is that a major criteria of setting up our alerts has been how noisy they are. I've been deliberately fanatical about this (both for new alerts we're considering adding and for current alerts), pretty much always choosing to err on the side of silence. This has meant that there are some situations that we just won't catch because I haven't been able to figure out how to create a non-noisy alert for them and eventually gave up.
(One of these is the flaky SMART errors from our Crucial MX500 SSDs.)
There are some technical things that have helped in this, of course. One of these is that we have a long Prometheus metrics retention period, which means that I can almost always evaluate a potential alert condition to see how frequently it would have triggered in the past (both the near past and the far past). If an alert rule looks like it would have triggered too often, at times when we didn't really have a problem, I have to revise it or drop the idea. Another is some deliberate infrastructure for 'testing' alerts, which are sent to a special destination that only goes to me instead of to all of my co-workers. I use this both for testing new alert messages and similar things and simply for evaluating alerts overall if I'm not certain about them. Not all testing alerts get promoted to production ones that notify everyone.
On a process level, I look at alerts we receive and continually ask if they're really meaningful and either actionable or important to know. If they don't seem to be any more (sometimes because conditions have changed), I will usually remove them after talking with my co-workers (who sometimes see more value in some alerts than I do). And if something is useful but not at the level of an alert, maybe I'll surface it on a dashboard.
One of the things that makes this work is that I'm both the (de-facto) alert maintainer and one of the sysadmins who receives and acts on alerts. This gives me both the ability to see if alerts are useful and the capability to immediately act on that knowledge; there's no need to go back and forth between two different groups or people, and no friction to it.
(Just in talking to people casually about alerting I've learned that my experiences here are far from universal.)
|
|