Wandering Thoughts archives

2022-06-05

Checking a few metrics (time series) at once in Prometheus's query language

Back in my entry on monitoring the status of Linux network interfaces with Prometheus, I mentioned that one of our alerts was for if any of the small number of network interfaces that were supposed to be at 10G weren't actually at 10G. However, I didn't explain how you did this; instead, I sort of treated it as obvious, since I had earlier written an entry that sort of did this.

(At some point what's obvious to me about PromQL will stop being so obvious any more, unless I keep regularly fiddling with our Prometheus setup on an ongoing basis. Which I suppose I have so far.)

The simple way to pick out a few things at once is with label matches. If you want a few hosts at once, for example:

node_network_speed_bytes { host=~"a|b|c|d", device="eno1" } != 1250000000

However, this doesn't help if you want to match only some specific combinations of two labels. If all of these hosts have both an eno1 and an eno2 interface, and on host a the eno1 interface is the 10G one but on host b the eno2 interface is, you can't match this with label selectors. Instead, you need a more powerful set union operation, which means or.

Using or, you simply list off the various matches you want:

( node_network_speed_bytes { host="a", device="eno1" } or
  node_network_speed_bytes { host="b", device="eno2" } )
        != 1250000000

Repeat for as many things as needed. Naturally you can combine label matches and or; if you have three hosts where eno2 should be at 10G and two hosts where eno1 should be, you can use appropriate 'host=~' matches. And so on. But generally I find it simpler to just list out everything without regular expressions; that way it's both more obvious what's included and easier to delete and add entries.

(If you care about which option is faster inside Prometheus, you probably have a very unusual situation. When in doubt, go for clarity in Prometheus alert rules; your future self will thank you.)

There's a subtle danger if you try a clever variation of this. Suppose that you actually want to check a number of different metrics at once, for example all of the TSDB health metrics Prometheus exposes. The lazy person's way of writing this alert expression is to use or again, and the result looks convincing:

( prometheus_tsdb_bad_1 or prometheus_tsdb_bad_2 or
  prometheus_tsdb_bad_3 or prometheus_tsdb_bad_4 ) > 0

Unfortunately this doesn't work, because of how exactly or is defined. Let's quote the definition of or carefully with some added emphasis:

vector1 or vector2 results in a vector that contains all original elements (label sets + values) of vector1 and additionally all elements of vector2 which do not have matching label sets in vector1.

The name of a metric is not considered a label for the purposes of or (and other logical/set binary operators). Since all four of our TSDB badness metrics are scraped from the same Prometheus, all of them will have the same label sets and so our 'or' expression will only ever give us the value of the first metric, 'prometheus_tsdb_bad_1'. As a result, this alert only ever trigger if prometheus_tsdb_bad_1 is above zero; the state of the other three is ignored, since they never appear in 'or's result vector.

This doesn't happen in the node_network_speed_bytes case because we'll have different labels on everything. They're different time series of the same metric, so by definition they have some labels that are different from all of the other versions. As a result, or will return multiple things, each of which will be checked against the speed we want (and each of which will generate a separate alert if their speed is too low).

The moral I take from this is that I shouldn't be too clever (or too lazy) when writing alert expressions, and also that I should be sure I'm generating the metrics I expect to be from logical/set binary operators.

(I discovered this or behavior in the process of writing this entry, and before I fully understood it I had a moment of alarm at the prospect that our 'is it at 10G' alert was only paying attention to the first of several host and interface combinations I had listed. Fortunately this isn't the case because of the different label sets.)

sysadmin/PrometheusCheckAFewMetrics written at 22:27:03; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.