Filtering Prometheus metrics with deliberately repeated labels
We have a SLURM
"cluster", by which we mean a pool of servers which people can
use to reserve some cores and RAM for themselves.
Each compute server in the cluster (a node in SLURM terms) needs
to be running the slurmd
daemon in order for people to be able
to use its resources. This daemon can die under some circumstances,
so we added an alert to check for slurmd not being active to our
Prometheus setup. However, we don't
want to alert on slurmd not being active on all of our machines; on
machines outside the SLURM cluster, it might be installed but not
active for various reasons. Fortunately, all of our SLURM nodes
follow a simple naming scheme; they're all called 'cpunodeNN',
eg 'cpunode1' or 'cpunode23'. This leads to a straightforward
alert rule expression, more or less (using a label for what
host the metric comes from):
node_systemd_unit_state { state="active", \ name="slurmd.service", \ host=~"cpunode.*" } != 1
Recently we took a couple of our SLURM nodes out of the cluster so they could become test nodes in a new Ubuntu 22.04 based version of the cluster. As test nodes, these may not be running slurmd all of the time, so we don't want to alert about slurmd not being active on them. So we need to exclude them from the alert.
At first I started thinking about clever things to do with the regular
expression for which hosts matched, because you certainly can write a
regexp that will match all one and two digit numbers except for, say, 9
and 23 (ie, cpunode9 and cpunode23). Then I realized there was a simpler
way. I could add a requirement that the host
label not be one of those
two hosts, through a new label match on host
. Like this:
node_systemd_unit_state { ..., \ host=~"cpunode.*", host!~"cpunode9|cpunode23" } != 1
When you repeat a label like this, you require the label to pass both
match conditions. Here, our host
label must be both a 'cpunodeNN'
name and not cpunode9 or cpunode23. This is exactly what we want and
puts the excluded hosts right into the alert rule along side the matched
hosts, rather than (say) in our Alertmanager configuration.
Using the same label name in multiple match conditions in a time series selector feels odd and it's certainly unusual. But there's no rule against it in PromQL and it fits into the general Prometheus data model, where your label matchers are just filtering the time series (starting with the name). In fact repeating labels this way is specifically allowed:
Label matchers that match empty label values also select all time series that do not have the specific label set at all. It is possible to have multiple matchers for the same label name.
(Emphasis mine.)
However, this technique of repeated matches of the same label has a limitation; it only works if you can exclude based on a single label. If you need to exclude based on the combination of labels (say 'network interface B on host A', where host A has several network interfaces and a network interface with that name is on several hosts), you have a more difficult challenge. See this entry's sidebar for some notes on this.
These days, PromQL is supported in projects other than Prometheus, often because they want to interoperate with Prometheus users or Prometheus related tools (see an overview of PromQL compliance test results). I don't know if all of these projects support multiple matchers for the same label name (it doesn't appear to be in the current compliance test suite), so if this is relevant to you, you might want to test it yourself.
(I consider this an issue worth thinking about for other PromQL implementations because having multiple matchers for the same label potentially affects your internal data structures and matching code.)
|
|