How I set up testing alerts in our Prometheus environment
One of the things I mentioned in my entry on how our alerts are quiet most of the time is that I have some Prometheus infrastructure for 'testing' alerts. Rather than being routed to everyone (via the normal email destination), these alerts go to a special destination that only goes to interested parties (ie, me). There are a number of different ways to implement this in Prometheus, so the way I picked to do it isn't necessarily the best one (and in fact it enables a bad habit, which is for another entry).
The simplest way to implement testing alerts is to set them up purely in Alertmanager. As part of your Alertmanager routing configuration, you would have a very early rule that simply listed all of the alerts that are in testing and diverted them. This would look something like this:
- match_re: alertname: 'OneAlert|DubiousAlert|MaybeAlert' receiver: testing-email [any other necessary parameters]
The problem with this is that it involves more work when you set up a new testing alert. You have to set up the alert itself in your Prometheus alert rules, and then you have to remember to go off to Alertmanager and update the big list of testing alerts. If you forget or make a typo, your testing alerts go to your normal alert receivers and annoy your co-workers. I'm a lazy person, so I picked a more general approach.
My implementation is that all testing alerts have a special Prometheus label with a special value, and then the Alertmanager matches on the presence of this (Prometheus) label. In Alertmanager this looks like:
- match: send: testing receiver: testing-email
Then in each Prometheus alert rule, we explicitly add the label and the label value in each testing rule:
- alert: MaybeAlert expr: .... labels: [...] send: testing annotations: [...]
(We add some other labels for each alert, to tell us things such as whether the alert is a host-specific one or some other type of alert, like a machine room being too hot.)
This enables my laziness, because I only need to edit one file to create a new testing alert instead of two of them, and there's a lower chance of typos and omissions. It also has the bonus of keeping the testing status of an alert visible in the alert rule file, at the expense of making it harder to get a list of all alerts that are in testing. For me this is probably a net win, because I look at alert rules more often than I look at our Alertmanager configuration so I have a higher chance of seeing a still-in-testing rule in passing and deciding to promote it to production. And if I'm considering promoting a testing alert to full production status, I can re-read the entire alert in one spot while I'm thinking about it.
(Noisy testing rules get removed rapidly, but quiet testing rules can just sit there with me forgetting about them.)