Calculating usage over time in Prometheus (and Grafana)
Suppose, not hypothetically, that you have a metric that says whether something is in use at a particular moment in time, such as a SLURM compute node or a user's VPN connection, and you would like to know how used it is over some time range. Prometheus can do this, but you may need to get a little clever.
The simplest case is when your metric is 1 if the thing is in use and 0 if it isn't, and the metric is always present. Then you can compute the percentage of use over a time range as a 0.0 to 1.0 value by averaging it over the time range, and then get the amount of time (in seconds) it was in use by multiplying that by the duration of the range (in seconds):
avg_over_time( slurm_node_available[$__range] )avg_over_time( slurm_node_available[$__range] ) * $__range_s
(Here $__range
is the variable Grafana uses for the time range
in some format for Prometheus, which has values such as '1d', and
$__range_s
is the Grafana variable for the time range in seconds.)
But suppose that instead of being 0 when the thing isn't in use, the metric is absent. For instance, you have metrics for SLURM node states that look like this:
slurm_node_state{ node="cpunode1", state="idle" } 1 slurm_node_state{ node="cpunode2", state="alloc" } 1 slurm_node_state{ node="cpunode3", state="drain" } 1
We want to calculate what percentage of the time a node is in the
'alloc
' state. Because the metric may be missing some of the time, we
can't just average it out over time any more; the average of a bunch of
1's and a bunch of missing metrics is 1. The simplest approach is to
use a subquery, like this:
sum_over_time( slurm_node_state{ state="alloc" }[$__range:1m] ) / ($__range_s / 60)
The reason we're using a subquery instead of simply a time range is so that we can control how many sample points there are over the time range, which gives us our divisor to determine the average. The relationship here is that we explicitly specify the subquery range step (here 1 minute aka 60 seconds) and then we divide the total range duration by that range step. If you change the range step, you also have to change the divisor or get wrong numbers, as I have experienced the hard way when I was absentminded and didn't think this one through.
If we want to know the total time in seconds that a node was allocated, we would multiply by the range step in seconds instead of dividing:
sum_over_time( slurm_node_state{ state="alloc" }[$__range:1m] ) * 60
Now let's suppose that we have a more complicated metric that isn't always 1 when the thing is active but that's still absent entirely when there's no activity (instead of being 0). As an example, I'll use the count of connections a user has to one of our VPN servers, which has a set of metrics like this:
vpn_user_sessions{ server="vpn1", user="cks" } 1 vpn_user_sessions{ server="vpn2", user="cks" } 2 vpn_user_sessions{ server="vpn1", user="fred" } 1
We want to work out the percentage of time or amount of time that any particular user has at least one connection to at least one VPN server. To do this, we need to start with a PromQL expression that is 1 when this condition is true. We'll use the same basic trick for crushing multiple metric points down to one that I covered in counting the number of distinct labels:
sum(vpn_user_sessions) by (user) > bool 0
The '> bool 0
' turns any count of current sessions into 1. If the
user has no sessions at the moment to any VPN servers, the metric
will still be missing (and we can't get around that), so we still
need to use a subquery to put this all together to get the percentage
of usage:
sum_over_time( (sum(vpn_user_sessions) by (user) > bool 0)[$__range:1m] ) / ($__range_s / 60)
As before, if we want to know the amount of time in seconds that a user has had at least one VPN connection, we would multiply by 60 instead of doing the division. Also as before, the range step and the '60' in the division (or multiplication) are locked together; if you change the range step, you must change the other side of things.
Sidebar: A subquery trick that doesn't work (and why)
On the surface, it seems like we could get away from the need to do our complicated division by using a more complicated subquery to supply a default value. You could imagine something like this:
avg_over_time( ( slurm_node_state{ state="alloc" } or vector(0) )[$__range:] )
However, this doesn't work. If you try it interactively in the Prometheus query dashboard, you will probably see that you get a bunch of the metrics that you expect, which all have the value 1, and then one unusual one:
{} 0
The reason that 'or vector(0)
' doesn't work is that we're asking
Prometheus to be superintelligent, and it isn't. What we get with
'vector(0)
' is a vector with a value of 0 and no labels. What we
actually want is a collection of vectors with all of the valid
labels that we don't already have as allocated nodes, and Prometheus
can't magically generate that for us for all sorts of good reasons.

