Understanding how to pull in labels from other metrics in Prometheus

May 20, 2019

Brian Brazil recently wrote Analyse a metric by kernel version, where he shows how to analyze a particular metric in a new way by, essentially, adding a label from another metric to the metric, in this case the kernel version. His example is a neat trick, but it's also reasonably tricky to understand how it works, so today I'm going to break it down (partly so that I can remember this in six months or a year from now, when my PromQL knowledge has inevitably rusted).

The query example is:

avg without (instance)(
    node_sockstat_TCP_tw 
  * on(instance) group_left(release)
    node_uname_info
)

The simple version of what's happening here is that because node_uname_info's value is always 1, we're using '*' as a do-nothing arithmetic operator so we can essentially do a join between node_sockstat_TCP_tw and node_uname_info to grab a label from the latter. We have to go to these lengths because PromQL does not have an explicit 'just do a join' operator that can be used with group_left.

There are several things in here. Let's start with the basic one, which is the '* on(instance)' portion. This is one to one vector matching with a restriction on what label is being used to match up pairs of entries; we're implicitly restricting the multiplication to pairs of entries with matching 'instance' labels. Normally 'instance' will be the same for all metrics scraped from a single host's node_exporter, so it makes a good label for finding the node_uname_info metric that corresponds to a particular host's node_sockstat_TCP_tw metric.

(We have to use 'on (...)' because not all labels match. After all, we're pulling in the 'release' label from the node_uname_info metric; if it was already available as a label on node_sockstat_TCP_tw, we wouldn't need to do this work at all.)

Next is the group_left, which is being used here for its side effect of incorporating the 'release' label from node_uname_info in the label set of the results. I wrote about the basics of group_left's operation in Using group_* vector matching for database lookups, where I used group_left basically as a database join between a disk space usage metric and an alert level metric that also carried an additional label we wanted to include for who should get alerted. Brian Brazil's overall query here is similar to my case, except that here we don't care about the value that the node_uname_info metric has; we are only interested in its 'release' label.

In an ideal world, we could express this directly in PromQL to say 'match between these two metrics based on instance and then copy over the release label from the secondary one'. In this world, unfortunately group_left and group_right have the limitation that they can only be used with arithmetic and comparison operators. In my earlier entry this wasn't a problem because we already wanted to compare the values of the two metrics, Here, we don't care about the value of node_uname_info at all. Since we need an arithmetic or comparison operator in order to use group_left and we want to ignore the value of node_uname_info, we need an operator that will leave node_sockstat_TCP_tw's value unchanged. Because the value of node_uname_info is always 1, we can simply use '*', as multiplying by one will do nothing here.

(In theory we could instead use a comparison operator, which would naturally leave node_sockstat_TCP_tw's value unchanged (more or less cf). However, in practice it's often tricky to find a comparison operator that will always be true. You might not have any sockets in TIME_WAIT so a '>=' could be false here, for example. Using an arithmetic operator that will have no effect is simpler.)

The case of a secondary metric that's always 1 is the easy case, as we've seen. What about a secondary metric with a label you want that isn't necessarily always 1, and in fact may have an arbitrary value? Fortunately, Brian Brazil has provided the answer to that too. The simple but clever trick is to multiply the metric by zero and then add it:

  node_sockstat_TCP_tw
+ on(instance) group_left(release)
  (node_uname_info * 0)

This works with arbitrary values; multiplying by zero turns the value for the right side to 0, and then adding 0 has no effect on node_sockstat_TCP_tw's value.

As a side note, this illustrates a good reason to have '1' be the value of any metric that exists to publish its labels, as is the case for node_uname_info or metrics that publish, say, the version of your program. The value these metrics have is arbitrary in one sense, but '1' is both conventional and convenient.

Written on 20 May 2019.
« DKIM signed email as a signal (of something)
Go is Google's language, not ours »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon May 20 22:16:32 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.