Filtering Prometheus metrics with deliberately repeated labels

May 6, 2022

We have a SLURM "cluster", by which we mean a pool of servers which people can use to reserve some cores and RAM for themselves. Each compute server in the cluster (a node in SLURM terms) needs to be running the slurmd daemon in order for people to be able to use its resources. This daemon can die under some circumstances, so we added an alert to check for slurmd not being active to our Prometheus setup. However, we don't want to alert on slurmd not being active on all of our machines; on machines outside the SLURM cluster, it might be installed but not active for various reasons. Fortunately, all of our SLURM nodes follow a simple naming scheme; they're all called 'cpunodeNN', eg 'cpunode1' or 'cpunode23'. This leads to a straightforward alert rule expression, more or less (using a label for what host the metric comes from):

node_systemd_unit_state { state="active", \
  name="slurmd.service", \
  host=~"cpunode.*" } != 1

Recently we took a couple of our SLURM nodes out of the cluster so they could become test nodes in a new Ubuntu 22.04 based version of the cluster. As test nodes, these may not be running slurmd all of the time, so we don't want to alert about slurmd not being active on them. So we need to exclude them from the alert.

At first I started thinking about clever things to do with the regular expression for which hosts matched, because you certainly can write a regexp that will match all one and two digit numbers except for, say, 9 and 23 (ie, cpunode9 and cpunode23). Then I realized there was a simpler way. I could add a requirement that the host label not be one of those two hosts, through a new label match on host. Like this:

node_systemd_unit_state { ..., \
  host=~"cpunode.*", host!~"cpunode9|cpunode23" } != 1

When you repeat a label like this, you require the label to pass both match conditions. Here, our host label must be both a 'cpunodeNN' name and not cpunode9 or cpunode23. This is exactly what we want and puts the excluded hosts right into the alert rule along side the matched hosts, rather than (say) in our Alertmanager configuration.

Using the same label name in multiple match conditions in a time series selector feels odd and it's certainly unusual. But there's no rule against it in PromQL and it fits into the general Prometheus data model, where your label matchers are just filtering the time series (starting with the name). In fact repeating labels this way is specifically allowed:

Label matchers that match empty label values also select all time series that do not have the specific label set at all. It is possible to have multiple matchers for the same label name.

(Emphasis mine.)

However, this technique of repeated matches of the same label has a limitation; it only works if you can exclude based on a single label. If you need to exclude based on the combination of labels (say 'network interface B on host A', where host A has several network interfaces and a network interface with that name is on several hosts), you have a more difficult challenge. See this entry's sidebar for some notes on this.

These days, PromQL is supported in projects other than Prometheus, often because they want to interoperate with Prometheus users or Prometheus related tools (see an overview of PromQL compliance test results). I don't know if all of these projects support multiple matchers for the same label name (it doesn't appear to be in the current compliance test suite), so if this is relevant to you, you might want to test it yourself.

(I consider this an issue worth thinking about for other PromQL implementations because having multiple matchers for the same label potentially affects your internal data structures and matching code.)

Written on 06 May 2022.
« When you install systems semi-manually, when updates get done matters
Solving a problem I had with the Unix date command in the right way »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri May 6 22:01:05 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.