Wandering Thoughts archives

2023-12-17

Prometheus's group_left() and group_right() operators

I'll start with the motivating story. Suppose, not hypothetically, that you have some Bind nameservers and a Prometheus environment, so you're monitoring those nameservers with the Bind exporter. One thing the Bind exporter does is provide the DNS SOA serial number for every zone Bind is configured to be a primary or a secondary for. If you have a primary and some internal secondaries (as we do), you'd like to be sure that your secondaries have the same DNS SOA serial numbers as your primary does. Writing an alert expression for this requires using one of PromQL's matching operators for many-to-one matching, since you have more than one secondary and one primary. However and speaking from recent personal experience, it's surprisingly easy to gloss over the details of the expression you want, especially if you start out with only one secondary. Since I've now stubbed my toes on this repeatedly, I'm going to write down in one spot the matrix of possibilities.

To save my future self some reading, here is the actual matrix that's explained in the rest of this entry, with the note that labels normally come from the 'many' side, whichever that is.

extra labels? 'many' on the left side 'many' on the right side
none from 'one' side group_left(notpresent) group_right(notpresent)
some from 'one' side group_left(label1, …) group_right(label1, …)

The 'notpresent' can be any label name that's not actually present; I use 'notpresent' for clarity. When adding extra labels, you don't include (and can't include) any that you've used in an 'on()', since those are not extra; they're already known to be the same between both sides.

First off, your choice between group_left() and group_right() is determined by which side is the 'many' side. If the left side is the many side, you use group_left(); if the right side is the many side, you use group_right(). Often the choice of which side is which will be determined by which side's value you want to use, because the value is the one thing you can't take from the other side. If you get the side wrong (or you could say the direction of the match wrong), you get the classical error:

Error executing query: found duplicate series for the match group [...] on the left hand-side of the operation: [...] many-to-many matching not allowed: matching labels must be unique on one side

(This is an error from when I used group_right() and should have used group_left(). It will say 'right hand-side' if it's the other way around. If you start out with what's currently a one to one match (because you only have one DNS resolver running Bind so far), you can have this error lurk unnoticed for a while.)

In my DNS SOA serial number alert, the 'many' side is the left hand side because I want the alert to include the incorrect SOA serial that the DNS secondary has. In a much earlier alert on disk space that used group_right(), the many side was the right hand side, because I wanted the alerts about low space on filesystems to mention the filesystem's current space (a 'one' metric) instead of the alert level for who was getting alerted (a 'many' metric when joined against the filesystem).

The second choice is whether you want any extra labels from the 'one' side. With group_left() this is the right side, and with group_right() it's the left side. In theory this sounds symmetric, but in practice it's not, because if you're forced to use group_right(), by itself your alert labels won't come from the metric whose value generated the alert. The value comes from the left side metric, but by default all the labels will come from the right side metric and you'll have to explicitly pull in all of the left side labels you may care about for generating alert messages.

(If you're using group_*() in order to pull in extra labels from the right hand side of a one to one match, this is why you want to use group_left() instead of group_right(); it automatically preserves all of the labels of your left side metric.)

Pulling in labels from the 'one' metric provides an opportunity to make an interesting mistake, which I've done in our Bind DNS SOA serial alert. Suppose that you start off with the wrong group_*() operator, but it works because you currently only have one metric set from your one DNS resolver running Bind. In this case, the labels will be wrong, so you'll stick them in from the other side:

bind_zone_serial{..} != on (zone_name) \
   group_right(host, instance, ...) \
     bind_zone_serial{ host="primary", view="internal" }

When you bring up your second DNS resolver running Bind, this will give you the error from above, and you may react by switching to the other group_*() operator. This will give you a different error:

Error executing query: multiple matches for labels: grouping labels must ensure unique matches.

This error is happening because you overwrote the unique labels from the 'many' side with labels from the 'one' side, which after many to one matching aren't necessarily unique any more. If both of your DNS resolvers have the wrong SOA for some zone (or you flipped the '!=' to '==' to test the alert), this gives you non-unique labels in the time series generated. This error took me some time to understand when I made it.

This is also why the labels come from the 'many' side, instead of always coming from the left side, like the value. Only the 'many' side is guaranteed to produce unique labels across all of the series produced.

sysadmin/PrometheusGroupLeftAndRightNotes written at 22:50:21;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.