Some alert inhibition rules we use in Prometheus Alertmanager
One of the things you can do with Alertmanager is to make one alert inhibit notifications for some other alerts. This is set up through inhibition rules. You can find some generic examples in sample Alertmanager configuration files, and today I'm writing up two specific inhibition rules that we use.
First off, inhibition rules are very closely tied to the structure
of your alerts, your labels, and your overall system; they can't
be understood or written outside of that. This is because all of
those determine both what alerts you want to suppress what other
alerts and how you do that matching. In our case, we group and
aggregate alerts by host, and all alerts have a '
that says what type of alert they are (a per-host alert, a temperature
alert, a disk space alert, etc) and host alerts have a '
label that is the host's canonical host name (more or less).
We have some hosts that have multiple network interfaces that we check; for instance, one of our backup servers is on our non-routed firewalls network so that it can back up our firewalls. We check both the backup server's main interface and its firewall interface, because otherwise we might not notice in time if it only dropped off the firewall network (we'd find out only when the nightly backups failed). At the same time, when the host goes down we don't want to get two sets of alerts, one for its main interface and one for its firewall interface. So we have an inhibition rule like this:
- target_match: cshost: HOST.fw.sandbox source_match: cshost: HOST equal: ['cstype', 'alertname', 'sendto', 'send']
The source match defines the alert that will suppress other alerts;
the target match defines what other alerts will be suppressed. So
this is saying that an alert for HOST can inhibit notification for
alerts for the firewall sandbox version of HOST. But not all alerts;
in order to be inhibited, the alerts must be of the same type, be
the same alert (the alertname), and be going to the same destination
(our 'sendto' label) and in the same way (our 'send' label). All of
that is set by the '
equal:' portion of the inhibition rule.
A somewhat more complicated case is our special 'there is a large scale problem' alert that inhibits per-host alerts. The top level inhibition rule is simple:
- source_match: cstype: largescale scale: all target_match: cstype: host
This says that an alert marked as being a large scale alert for all
hosts inhibits all 'host' type alerts, which is all of our per-host
alerts. Other types of alerts will be allowed through (eg temperature
alerts), if they can still be triggered from the data that's still
available to Prometheus. Because this inhibits across two completely
different types of alerts, we don't have any labels we can sensibly
check label equality on in an '
(In a large scale problem we probably can't talk to temperature sensors, get disk usage information, and so on, so any problems there will go unnoticed anyway.)
For reasons outside the scope of this entry, we also have another large scale problem alert if there are enough problems with the machines in our SLURM compute cluster. The existence of this alert creates a problem; if there a real large scale problem, we will have enough problematic SLURM nodes to also trigger this alert. So we also inhibit the more specific SLURM large scale alert when the general large scale alert is active:
- source_match: cstype: largescale scale: all target_match_re: cstype: largescale scale: 'slurm'
We have to use '
scale' here in the target match (and have it at
all), because if we left it out Alertmanager would happily allow
our global large scale problems alert to inhibit itself. (The use
target_match_re' instead of '
target_match' is a historical
relic and should be changed.)
(We could use '
alertname' instead of our own custom label to tell
the two apart, but I refer a custom label to make things explicit.)
Finally, the 'large scale SLURM problems' alert looks like the general one but has to apply only to the SLURM nodes, not to all machines. We currently do this by regular expression matching instead of trying to have a suitable label on everything to do with those machines:
- source_match: cstype: largescale scale: slurm target_match_re: cstype: "(host|notify)" cshost: "(cpunode|gpunode|cpuramnode|amdgpunode).*"
(Here the use of
target_match_re is required, since we really
are matching regular expressions.)
This inhibits both per-host alerts and our reboot notifications. We don't inhibit reboot notifications in our large scale problems inhibition, because it's handy to get some sign that machines are back, but this isn't interesting for SLURM nodes.
There is a subtle bit of behavior here. Inhibition only stops notifications for an alert; the alert continues to exist in general inside Alertmanager, and in particular it can inhibit other alerts. So when we have a large scale problem, the large scale alert inhibits notification about the large scale SLURM alert and in turn the large scale SLURM alert inhibits our reboot notifications for the SLURM nodes. This is sufficiently tricky that I should probably add a comment about it to the Alertmanager configuration file.