2019-12-02
You can have Grafana tables with multiple values for a single metric (with Prometheus)
Every so often, the most straightforward way to show some information in a Grafana dashboard is with a table, for example to list how long it is before TLS certificates expire, how frequently people are using your VPN servers, or how much disk space they're using. However, sometimes you want to present the underlying information in more than one way; for example, you might want to list both how many days until a TLS certificate expires and the date on which it will expire. The good news is that Grafana tables can do this, because Grafana will merge query results with identical Prometheus label sets (more or less).
(There's a gotcha with this that we will come to.)
In a normal Grafana table, your column fields are the labels of the metric and a 'Value' field that is whatever computed value your PromQL query returned. When you have several queries, the single 'Value' field turns into, eg, 'Value #A', 'Value #B', and so on, and all of them can be displayed in the table (and given more useful names and perhaps different formatting, so Grafana knows that one is a time in seconds and another is a 0.0 to 1.0 percentage). If the Prometheus queries return the same label sets, every result with the same set of labels will get merged into a single row in the table, with all of the 'Value #<X>' fields having values. If not all sets of labels show up in all queries, the missing results will generally be shown as '-'.
(Note that what matters for merging is not what fields you display, but all of the fields. Grafana will not merge rows just because your displayed fields have the same values.)
The easiest way to get your label sets to be the same is to do the
same query, just with different math applied to the query's value.
You can do this to present TLS expiry as a duration and an absolute
time, or usage over time as both a percentage and an amount of time
(as seen in counting usage over time).
A more advanced version is to do different queries while making
sure that they return the same labels, possibly by either restricting
what labels are returned with use of 'by (...)
' and similar
operators (as sort of covered in this entry).
When you're doing different queries of different metrics, an important
gotcha comes up. When you do simple queries, Prometheus and Grafana
acting together add a __name__
label field with the name of the
metric involved. You're probably not displaying this field, but its
mere presence with a different value will block field merging. To
get rid of it, you have various options, such as adding '+ 0
' to
the query or using some operator or function (as seen in the comments
of this Grafana pull request and
this Grafana issue).
Conveniently, if you use 'by (...)
' with an operator to get rid of
some normal labels, you'll get rid of __name__
as well.
All of this only works if you want to display two values for the same set of labels. If you want to pull in labels from multiple metrics, you need to do the merging in your PromQL query, generally using the usual tricks to pull in labels from other metrics.
(I'm writing this all down because I wound up doing this recently and I want to capture what I learned before I forget how to do it.)
Calculating usage over time in Prometheus (and Grafana)
Suppose, not hypothetically, that you have a metric that says whether something is in use at a particular moment in time, such as a SLURM compute node or a user's VPN connection, and you would like to know how used it is over some time range. Prometheus can do this, but you may need to get a little clever.
The simplest case is when your metric is 1 if the thing is in use and 0 if it isn't, and the metric is always present. Then you can compute the percentage of use over a time range as a 0.0 to 1.0 value by averaging it over the time range, and then get the amount of time (in seconds) it was in use by multiplying that by the duration of the range (in seconds):
avg_over_time( slurm_node_available[$__range] )avg_over_time( slurm_node_available[$__range] ) * $__range_s
(Here $__range
is the variable Grafana uses for the time range
in some format for Prometheus, which has values such as '1d', and
$__range_s
is the Grafana variable for the time range in seconds.)
But suppose that instead of being 0 when the thing isn't in use, the metric is absent. For instance, you have metrics for SLURM node states that look like this:
slurm_node_state{ node="cpunode1", state="idle" } 1 slurm_node_state{ node="cpunode2", state="alloc" } 1 slurm_node_state{ node="cpunode3", state="drain" } 1
We want to calculate what percentage of the time a node is in the
'alloc
' state. Because the metric may be missing some of the time, we
can't just average it out over time any more; the average of a bunch of
1's and a bunch of missing metrics is 1. The simplest approach is to
use a subquery, like this:
sum_over_time( slurm_node_state{ state="alloc" }[$__range:1m] ) / ($__range_s / 60)
The reason we're using a subquery instead of simply a time range is so that we can control how many sample points there are over the time range, which gives us our divisor to determine the average. The relationship here is that we explicitly specify the subquery range step (here 1 minute aka 60 seconds) and then we divide the total range duration by that range step. If you change the range step, you also have to change the divisor or get wrong numbers, as I have experienced the hard way when I was absent-minded and didn't think this one through.
If we want to know the total time in seconds that a node was allocated, we would multiply by the range step in seconds instead of dividing:
sum_over_time( slurm_node_state{ state="alloc" }[$__range:1m] ) * 60
Now let's suppose that we have a more complicated metric that isn't always 1 when the thing is active but that's still absent entirely when there's no activity (instead of being 0). As an example, I'll use the count of connections a user has to one of our VPN servers, which has a set of metrics like this:
vpn_user_sessions{ server="vpn1", user="cks" } 1 vpn_user_sessions{ server="vpn2", user="cks" } 2 vpn_user_sessions{ server="vpn1", user="fred" } 1
We want to work out the percentage of time or amount of time that any particular user has at least one connection to at least one VPN server. To do this, we need to start with a PromQL expression that is 1 when this condition is true. We'll use the same basic trick for crushing multiple metric points down to one that I covered in counting the number of distinct labels:
sum(vpn_user_sessions) by (user) > bool 0
The '> bool 0
' turns any count of current sessions into 1. If the
user has no sessions at the moment to any VPN servers, the metric
will still be missing (and we can't get around that), so we still
need to use a subquery to put this all together to get the percentage
of usage:
sum_over_time( (sum(vpn_user_sessions) by (user) > bool 0)[$__range:1m] ) / ($__range_s / 60)
As before, if we want to know the amount of time in seconds that a user has had at least one VPN connection, we would multiply by 60 instead of doing the division. Also as before, the range step and the '60' in the division (or multiplication) are locked together; if you change the range step, you must change the other side of things.
Sidebar: A subquery trick that doesn't work (and why)
On the surface, it seems like we could get away from the need to do our complicated division by using a more complicated subquery to supply a default value. You could imagine something like this:
avg_over_time( ( slurm_node_state{ state="alloc" } or vector(0) )[$__range:] )
However, this doesn't work. If you try it interactively in the Prometheus query dashboard, you will probably see that you get a bunch of the metrics that you expect, which all have the value 1, and then one unusual one:
{} 0
The reason that 'or vector(0)
' doesn't work is that we're asking
Prometheus to be superintelligent, and it isn't. What we get with
'vector(0)
' is a vector with a value of 0 and no labels. What we
actually want is a collection of vectors with all of the valid
labels that we don't already have as allocated nodes, and Prometheus
can't magically generate that for us for all sorts of good reasons.