In Prometheus queries, on and ignoring don't drop labels from the result

June 17, 2021

Today I learned that one of the areas of PromQL, the query language for Prometheus that I'm a still a bit weak on is when labels will and won't get dropped from metrics as you manipulate them in a query. So I'll start with the story.

Today I wrote an alert rule to make sure that the network interfaces on our servers hadn't unexpectedly dropped down to 100 Mbit/second (instead of 1Gbit/s or for some servers 10Gbit/s). We have a couple of interfaces on a couple of servers that legitimately are at 100M (or as legitimately as a 100M connection can be in 2021), and I needed to exclude them. The speed of network interfaces is reported by node_exporter in node_network_speed_bytes, so I first wrote an expression using unless and all of the labels involved:

node_network_speed_bytes == 12500000 unless
  ( node_network_speed_bytes{host="host1",device="eno2",...} or
    node_network_speed_bytes{host="host2",device="eno1",...} )

However, most of the standard labels you get on metrics from the host agent (such as job, instance, and so on) are irrelevant and even potentially harmful to include (the full set of labels might have to change someday). The labels I really care about are the host and the device. So I rewrote this as:

node_network_speed_bytes == 12500000 unless on(host,device) [....]

When I wrote this expression I wasn't sure if it was going to drop all other labels beside host and device from the filtered end result of the PromQL expression. It turns out that it didn't; the full set of labels for node_network_speed_bytes is passed through, even though we're only matching on some of them in the unless.

(The host and the device are all that I needed for the alert message so it wouldn't have been fatal if the other labels were dropped. But it's better to retain them just in case.)

Aggregation operators discard labels unless you use without or by, as covered by their documentation (although it's not phrased that way), since aggregating over labels is their purpose. As I've found out, careless use of aggregation operators can lose labels that are valuable for alerts (which may be what left me jumpy about this case). Aggregation over time keeps all labels, though, because it's aggregating over time instead of over some or all labels. But as I was reminded today (since I'm sure I've seen it before), vector matching using on and ignoring don't drop labels, they merely restrict what labels are used in the matching (and then it's up to you to make sure you still have a one to one vector match or at least a match that you expect; I've made mistakes there).

(You can also explicitly pull in additional labels from other metrics.)

There may be other cases in PromQL where labels are dropped, but if so I can't think of them right now. My overall moral is that I still need to test my assumptions and guesses in order to be sure about this stuff.

Sidebar: Why I used unless (... or ...) in this query

In many cases, the obvious way to exclude some things from an alert rule expression is to use negative label matches. However, these can't match on the combination of several labels instead of the value of a single label. As far as I know, if you want to exclude only certain label combinations (here 'host1 and eno2' and 'host2 and eno1') where the individual label elements can occur separately (so host1 and host2 both have other network interfaces, and other hosts have eno1 and eno2 interfaces), you're stuck with more awkward construction I used. This construction is unfortunately somewhat brute force.

Written on 17 June 2021.
« The challenge of what to set server BIOSes to do on power loss
Apache directory indexes will notice and exclude blocked URLs in them »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jun 17 00:40:10 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.