2024-03-25
Options for diverting alerts in Prometheus
Suppose, not hypothetically, that you have a collection of machines and some machines are less important than others or are of interest only to a particular person. Alerts about normal machines should go to everyone; alerts about the special machines should go elsewhere. There are a number of options to set this up in Prometheus and Alertmanager, so today I want to run down a collection of them for my own future use.
First, you have to decide the approach you'll use in Alertmanager. One option is to is to specifically configure an early Alertmanager route that specifically knows the names of these machines. This is the most self-contained option, but it has the drawback that Alertmanager routes can often intertwine in complicated ways that are hard to keep track of. For instance, you need to keep your separate notification routes for these machines in sync.
(I should write down in one place the ordering requirements for routes in our Alertmanager configuration, because several times I've made changes that didn't have the effect I wanted because I had the route in the wrong spot.)
The other Alertmanager option is to set up general label-based markers for alerts that should be diverted and rely on Prometheus to get the necessary label on to the alerts about these special machines. My view is that you're going to want to have such 'testing' alerts in general, so sooner or later you're going to wind up with this in your Alertmanager configuration.
Once Prometheus is responsible for labeling the specific alerts that should be diverted, you have some options:
- The Prometheus alert rule
can specifically add the appropriate label. This works great if it's
a testing alert rule that you always want to divert, but less well if
it's a general alert that you only want to divert some of the time.
- You can arrange for metrics from the specific machines to have the
special label values necessary.
This has three problems. First, it creates additional metrics
series if you change how a machine's alerts are handled. Second, it may require ugly
contortions to pull some scrape targets out to different sections
of a static file, so you can put different labels on them. And
lastly, it's error-prone, because you have to make sure all of the
scrape targets for the machine have the label on them.
(You might even be doing special things in your alert rules to create alerts for the machine out of metrics that don't come from scraping it, which can require extra work to add labels to them.)
- You can add the special label marker in Prometheus alert relabeling,
by matching against your 'host' label and creating a new label. This
will be something like:
- source_labels: [host] regex: vmhost1 target_label: send replacement: testing
You'll likely want to do this at the end, or at least after any other alert label canonicalization you're doing to clean up host names, map service names to hosts, and so on.
Now that I've sat down and thought about all of these options, the one I think I like the best is alert relabeling. Alert relabeling in Prometheus puts this configuration in one central place, instead of spreading it out over scrape targets and alert rules, and it does so in a setting that doesn't have quite as many complex ordering issues as Alertmanager routes do.
(Adding labels in alert rules is still the right answer if the alert itself is in testing, in my view.)