Some alert inhibition rules we use in Prometheus Alertmanager

February 27, 2020

One of the things you can do with Alertmanager is to make one alert inhibit notifications for some other alerts. This is set up through inhibition rules. You can find some generic examples in sample Alertmanager configuration files, and today I'm writing up two specific inhibition rules that we use.

First off, inhibition rules are very closely tied to the structure of your alerts, your labels, and your overall system; they can't be understood or written outside of that. This is because all of those determine both what alerts you want to suppress what other alerts and how you do that matching. In our case, we group and aggregate alerts by host, and all alerts have a 'cstype' label that says what type of alert they are (a per-host alert, a temperature alert, a disk space alert, etc) and host alerts have a 'cshost' label that is the host's canonical host name (more or less).

We have some hosts that have multiple network interfaces that we check; for instance, one of our backup servers is on our non-routed firewalls network so that it can back up our firewalls. We check both the backup server's main interface and its firewall interface, because otherwise we might not notice in time if it only dropped off the firewall network (we'd find out only when the nightly backups failed). At the same time, when the host goes down we don't want to get two sets of alerts, one for its main interface and one for its firewall interface. So we have an inhibition rule like this:

- target_match:
    cshost: HOST.fw.sandbox
  source_match:
    cshost: HOST
  equal: ['cstype', 'alertname', 'sendto', 'send']

The source match defines the alert that will suppress other alerts; the target match defines what other alerts will be suppressed. So this is saying that an alert for HOST can inhibit notification for alerts for the firewall sandbox version of HOST. But not all alerts; in order to be inhibited, the alerts must be of the same type, be the same alert (the alertname), and be going to the same destination (our 'sendto' label) and in the same way (our 'send' label). All of that is set by the 'equal:' portion of the inhibition rule.

A somewhat more complicated case is our special 'there is a large scale problem' alert that inhibits per-host alerts. The top level inhibition rule is simple:

- source_match:
    cstype: largescale
    scale: all
  target_match:
    cstype: host

This says that an alert marked as being a large scale alert for all hosts inhibits all 'host' type alerts, which is all of our per-host alerts. Other types of alerts will be allowed through (eg temperature alerts), if they can still be triggered from the data that's still available to Prometheus. Because this inhibits across two completely different types of alerts, we don't have any labels we can sensibly check label equality on in an 'equals:' portion.

(In a large scale problem we probably can't talk to temperature sensors, get disk usage information, and so on, so any problems there will go unnoticed anyway.)

For reasons outside the scope of this entry, we also have another large scale problem alert if there are enough problems with the machines in our SLURM compute cluster. The existence of this alert creates a problem; if there a real large scale problem, we will have enough problematic SLURM nodes to also trigger this alert. So we also inhibit the more specific SLURM large scale alert when the general large scale alert is active:

- source_match:
    cstype: largescale
    scale: all
  target_match_re:
    cstype: largescale
    scale: 'slurm'

We have to use 'scale' here in the target match (and have it at all), because if we left it out Alertmanager would happily allow our global large scale problems alert to inhibit itself. (The use of 'target_match_re' instead of 'target_match' is a historical relic and should be changed.)

(We could use 'alertname' instead of our own custom label to tell the two apart, but I refer a custom label to make things explicit.)

Finally, the 'large scale SLURM problems' alert looks like the general one but has to apply only to the SLURM nodes, not to all machines. We currently do this by regular expression matching instead of trying to have a suitable label on everything to do with those machines:

- source_match:
    cstype: largescale
    scale: slurm
  target_match_re:
    cstype: "(host|notify)"
    cshost: "(cpunode|gpunode|cpuramnode|amdgpunode).*"

(Here the use of target_match_re is required, since we really are matching regular expressions.)

This inhibits both per-host alerts and our reboot notifications. We don't inhibit reboot notifications in our large scale problems inhibition, because it's handy to get some sign that machines are back, but this isn't interesting for SLURM nodes.

There is a subtle bit of behavior here. Inhibition only stops notifications for an alert; the alert continues to exist in general inside Alertmanager, and in particular it can inhibit other alerts. So when we have a large scale problem, the large scale alert inhibits notification about the large scale SLURM alert and in turn the large scale SLURM alert inhibits our reboot notifications for the SLURM nodes. This is sufficiently tricky that I should probably add a comment about it to the Alertmanager configuration file.

Written on 27 February 2020.
« The magic settings to make a bar graph in Grafana
One reason for Go to prefer providing indexes in for ... range loops »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Feb 27 22:00:31 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.