2022-06-05
Checking a few metrics (time series) at once in Prometheus's query language
Back in my entry on monitoring the status of Linux network interfaces with Prometheus, I mentioned that one of our alerts was for if any of the small number of network interfaces that were supposed to be at 10G weren't actually at 10G. However, I didn't explain how you did this; instead, I sort of treated it as obvious, since I had earlier written an entry that sort of did this.
(At some point what's obvious to me about PromQL will stop being so obvious any more, unless I keep regularly fiddling with our Prometheus setup on an ongoing basis. Which I suppose I have so far.)
The simple way to pick out a few things at once is with label matches. If you want a few hosts at once, for example:
node_network_speed_bytes { host=~"a|b|c|d", device="eno1" } != 1250000000
However, this doesn't help if you want to match only some specific
combinations of two labels. If all of these hosts have both an eno1
and an eno2 interface, and on host a the eno1 interface is the 10G
one but on host b the eno2 interface is, you can't match this with
label selectors. Instead, you need a more powerful set union
operation,
which means or
.
Using or
, you simply list off the various matches you want:
( node_network_speed_bytes { host="a", device="eno1" } or node_network_speed_bytes { host="b", device="eno2" } ) != 1250000000
Repeat for as many things as needed. Naturally you can combine label
matches and or
; if you have three hosts where eno2 should be at
10G and two hosts where eno1 should be, you can use appropriate
'host=~' matches. And so on. But generally I find it simpler to
just list out everything without regular expressions; that way it's
both more obvious what's included and easier to delete and add
entries.
(If you care about which option is faster inside Prometheus, you probably have a very unusual situation. When in doubt, go for clarity in Prometheus alert rules; your future self will thank you.)
There's a subtle danger if you try a clever variation of this.
Suppose that you actually want to check a number of different metrics
at once, for example all of the TSDB health metrics Prometheus
exposes. The lazy person's way of writing
this alert expression is to use or
again, and the result looks
convincing:
( prometheus_tsdb_bad_1 or prometheus_tsdb_bad_2 or prometheus_tsdb_bad_3 or prometheus_tsdb_bad_4 ) > 0
Unfortunately this doesn't work, because of how exactly or
is
defined. Let's quote the definition of or
carefully with some added emphasis:
vector1 or vector2
results in a vector that contains all original elements (label sets + values) ofvector1
and additionally all elements ofvector2
which do not have matching label sets invector1
.
The name of a metric is not considered a label for the purposes of
or
(and other logical/set binary operators). Since all four of
our TSDB badness metrics are scraped from the same Prometheus, all
of them will have the same label sets and so our 'or
' expression
will only ever give us the value of the first metric,
'prometheus_tsdb_bad_1
'. As a result, this alert only ever
trigger if prometheus_tsdb_bad_1
is above zero; the state of
the other three is ignored, since they never appear in 'or
's
result vector.
This doesn't happen in the node_network_speed_bytes
case because
we'll have different labels on everything. They're different time
series of the same metric, so by definition they have some labels
that are different from all of the other versions. As a result,
or
will return multiple things, each of which will be checked
against the speed we want (and each of which will generate a separate
alert if their speed is too low).
The moral I take from this is that I shouldn't be too clever (or too lazy) when writing alert expressions, and also that I should be sure I'm generating the metrics I expect to be from logical/set binary operators.
(I discovered this or
behavior in the process of writing this
entry, and before I fully understood it I had a moment of alarm at
the prospect that our 'is it at 10G' alert was only paying attention
to the first of several host and interface combinations I had listed.
Fortunately this isn't the case because of the different label
sets.)