Doing a selective alert about a host's additional exporters in Prometheus

June 6, 2022

Suppose that you have some hosts that run additional exporters over and above the Prometheus host agent, such as the mtail log exporter or perhaps a Blackbox exporter that lets you check an additional network segment. Obviously you want to alert if one these additional exporters is down, but if the host itself is down you don't necessarily want to also be told that all of its exporters are down. It's sort of implied by everything on the host is down with it.

When I was writing alerts for mtail (and some other additional on-host exporters) I came up with a tricky way of doing this entirely within the Prometheus alert rule. Here it is:

 (up{job="mtail"} != on (host) group_left (notpresent)
  up{job="node"}) == 0

To understand why I've written this with group_left, let's start with a simpler version without it:

(up{job="mtail"} != on (host) up{job="node"}) == 0

The inner '(...)' matches only if the mtail exporter is not in the same 'up' state as the host's node_exporter. The '== 0' then only matches if this state difference is because mtail is down.

The simpler version is a perfectly good rule but it has a limitation, which is that the only label you get on the result is the `host' label. In particular, you don't get the 'instance' label that will normally tell you (or someone reading the alert message) what port mtail is expected to be on. In order to get additional labels from the 'up{job="mtail"}' we need to switch from a plain comparison to group_left. Group_left is what you use to keep labels and also add labels from other metrics, although that's not relevant here.

This gives us the more complicated version, which preserves the 'instance' label (and anything else relevant) for use in the alert message. Since we don't want any labels from `up{job="node"}', we tell `group_left' to operate on a nonexistent label. I tend to use hopefully clear names for these, so that later I have a chance to understand right away that there isn't supposed to be a label like that.

All of this is well and good, but in some cases you might not want to suppress alerts about additional exporters on a host being down. One of those cases is actually the example I gave above, of a host running the Blackbox exporter to give you visibility into other network segments (cf). In this case, Blackbox being down because the host is down has an effect beyond the host; all of your probes through it are going to go dark. You may want to include this information in the alerts you get so that you get reminded of your (temporary) observability gap.

My current view is that under normal circumstances, I should only suppress alerts about additional exporters that give us information from the machine itself. If there are exporters on the machine that give us information from beyond the machine, I probably want to include their alert in the 'this machine is down' collection of alerts, and I probably want their alert message to say something about what information we don't have right now.

Written on 06 June 2022.
« Checking a few metrics (time series) at once in Prometheus's query language
TLS Certificate Transparency logs don't always talk to you »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jun 6 22:24:40 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.