Using group_* vector matching in Prometheus for database lookups

October 22, 2018

On Mastodon, I said:

Current status: writing a Prometheus expression involving 'group_left (sendto) ...' and cackling maniacally.

Boy am I abusing metrics as a source of facts and configuration information, but it's going to beat writing and maintaining a bunch of Prometheus alert rules for people.

(If a system gives me an awkward hammer as my only tool, why yes, I will hit everything with it. Somehow.)

There are many things bundled up in this single toot, but today I'm going to write down the details of what I'm doing in my PromQL before I forget them, because it involves some tricks and hacks (including my use of group_left).

Suppose, not hypothetically, that you have quite a lot of ZFS filesystems and pools and that you want to generate alerts when they start running low on disk space. We start out with a bunch of metrics on the currently available disk space that look like this:

our_zfs_avail_gb{ pool="tank", fs="/h/281", type="fs" } 35.1
our_zfs_avail_gb{ pool="tank", fs="tank", type="pool" } 500.8

(In real life you would use units of bytes, not fractional GB, but I'm changing it to avoid having to use giant numbers. Also, this is an incomplete set of metrics; I'm just including enough for this entry.)

If life was very simple, we could write an alert rule expression for our space alerts that looked like this:

our_zfs_avail_gb < 10

The first problem with this is that we might find that space usage was oscillating right around our alert point. We want to smooth that out, and while there are probably many ways of doing that, I'll go with the simple approach of looking at the average space usage over the last 15 minutes:

avg_over_time(our_zfs_avail_gb [15m]) < 10

In PromQL, avg_over_time is one of the family of X_over_time functions that do their operation over a time range to give you a single number.

If life was simple, we could stop now. Unfortunately, not only do we have a wide variety of ZFS filesystems but they're owned by a wide variety of people, who are who should be notified when the space is low because they're the only ones who can do anything about it, and these people have widely varying opinions about what level of free space is sufficiently low to be alertable on. In other words, we need to parameterize both our alert level and who gets notified on a per-filesystem basis.

In theory you could do this with a whole collection of Prometheus alerting rules, one for each combination of an owner and a set of filesystems with the same low space alert level. In practice this would be crazy to maintain by hand; you'd have to generate all of the alert rules from templates and external information and it would get very complicated very fast. Instead we can use brute force and the only good tool that Prometheus gives us for dynamic lookups, which is metrics.

We'll create a magic metrics sequence that encodes both the free space alert level and the owner of each filesystem. These metrics will look like this:

our_zfs_minfree_gb{ fs="/h/281", sendto="cks" }     50
our_zfs_minfree_gb{ fs="tank", sendto="sysadmins" } 200

These metrics can be pushed into Prometheus in various ways, for example by writing them into a text file for the Prometheus node exporter to pick up, or sent into a Pushgateway (which will persist them for us).

So our starting point for a rule is the obvious (but non-working):

avg_over_time(our_zfs_avail_gb [15m]) < our_zfs_minfree_gb

If we tried this, we would get no results at all. Why this doesn't work is that Prometheus normally requires completely matching labels across your expression (as described in the documentation for comparison binary operators and vector matching). These metrics don't have matching labels; even if they had no other labels that clashed (and in real life they will), our_zfs_avail_gb has the pool and type labels, and our_zfs_minfree_gb side has the sendto label.

As I've learned the hard way, in any PromQL expression involving multiple metrics it's vital to understand what labels you have and where they might clash. It's very easy to write a query that returns no data because you have mis-matched labels (I've done it a lot as I've been learning to work with PromQL).

To work around this issue, we need to tell PromQL to do the equivalent of a database join on the fs label to pick out the matching our_zfs_minfree_gb value for a given filesystem. Since we're doing a comparison between (instant) vectors, this is done with the on modifier for vector matches:

avg_over_time(our_zfs_avail_gb [15m]) < on (fs) our_zfs_minfree_gb

If we apply this by itself (and /h/281 has had its current usage over our 15 minute window), we will get a result that looks like this:

{ fs="/h/281" } 35.1

What has happened here is that Prometheus is sort of doing what we told it to do. We implicitly told it that fs was the only label that mattered to us by making it the label we cross-matched on, so it reduced the labels in the result down to that label.

This is not what we want. We want to carry all of the labels from our_zfs_avail_gb over to the output, so that our alerts can be summarized by pool and so on, and we need to pull in the sendto label from our_zfs_minfree_gb so that Alertmanager knows who to send them to. To do this, we abuse the group_left many-to-one vector matching operator.

The full expression is now (with a linebreak for clarity):

avg_over_time(our_zfs_avail_gb [15m]) <
  on (fs) group_left (sendto) our_zfs_minfree_gb

When we use group_left here, two things happen for us. First, all of the labels from the metric on the left side of the expression are included in the result, so we get all of the labels from our_zfs_avail_gb, including pool. Second, group_left also includes the label we listed from the right metric. The result is:

{ pool="tank", fs="/h/281", type="fs", sendto="cks" } 35.1

Strictly speaking, this is an abuse of group_left because our left and our right metrics have the same cardinality. So let's talk about PromQL cardinality for a moment. When PromQL does vector matches in operations like <, it normally requires that exactly one metric on the left match exactly one metric on the right; if there are too many metrics on either the left or the right, PromQL just punts and skips the metric(s). The matching is done on their full labels by default. When you use on or without, you narrow the matching to happen only on those labels or without those labels, but PromQL still requires a one to one match.

Since plain on worked for us, we had that one to one matching already. So we're using group_left only for its side effects of including extra labels, not because we need it for a many to one match. If we changed group_left to group_right, we would get the same set of matches and outputs, but the labels would change:

{ fs="/h/281", sendto="cks" } 35.1

This is because now the labels are coming from the right metric, augmented by any labels from the left metric added by group_*, which in this case doesn't include anything new. If we wanted to get the same results, we would have to include the left side labels we wanted to add:

avg_over_time(our_zfs_avail_gb [15m]) <
  on (fs) group_right (pool, type) our_zfs_minfree_gb

This would get us the same labels, although in a different order because group_* appends the extra labels they add on the end:

{ fs="/h/281", sendto="cks", pool="tank", type="fs" } 35.1

Now, suppose that we didn't have the sendto label and we were using our_zfs_minfree_gb purely to set a per-filesystem level. However, we still want to carry over all of the labels from our_zfs_avail_gb into the output, so that they can be used by Alertmanager. Our quick first attempt at this would probably be:

avg_over_time(our_zfs_avail_gb [15m]) <
  on (fs) group_left (fs) our_zfs_minfree_gb

If we try this, PromQL will immediately give us an error message:

[...]: label "fs" must not occur in ON and GROUP clause at once

This restriction is documented but annoying. Fortunately we can get around it because the group_* operators don't require that their new label(s) actually exist. So we can just give them a label that isn't even in our metric and they're happy:

avg_over_time(our_zfs_avail_gb [15m]) <
  on (fs) group_left (bogus) our_zfs_minfree_gb

This will give us just the labels from the left:

{ pool="tank", fs="/h/281", type="fs" } 35.1

(If we wanted just the labels from the right we could use group_right instead.)

PS: In the expression that I've built up here, any filesystem without an our_zfs_minfree_gb metric will have no free space alert level; it can run right down to 0 bytes left and you'll get no alert about it. Fixing this in the PromQL expression is complicated for reasons beyond the scope of this entry, so in my opinion the best place to fix it is in the tools that generate and check your our_zfs_minfree_gb metrics from some data file in a more convenient format.

Written on 22 October 2018.
« Some tradeoffs of having a Certificate Authority in your model
Some DKIM usage statistics from our recent inbound email (October 2018 edition) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Oct 22 01:44:30 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.