Calculating usage over time in Prometheus (and Grafana)

December 2, 2019

Suppose, not hypothetically, that you have a metric that says whether something is in use at a particular moment in time, such as a SLURM compute node or a user's VPN connection, and you would like to know how used it is over some time range. Prometheus can do this, but you may need to get a little clever.

The simplest case is when your metric is 1 if the thing is in use and 0 if it isn't, and the metric is always present. Then you can compute the percentage of use over a time range as a 0.0 to 1.0 value by averaging it over the time range, and then get the amount of time (in seconds) it was in use by multiplying that by the duration of the range (in seconds):

avg_over_time( slurm_node_available[$__range] )
avg_over_time( slurm_node_available[$__range] ) * $__range_s

(Here $__range is the variable Grafana uses for the time range in some format for Prometheus, which has values such as '1d', and $__range_s is the Grafana variable for the time range in seconds.)

But suppose that instead of being 0 when the thing isn't in use, the metric is absent. For instance, you have metrics for SLURM node states that look like this:

slurm_node_state{ node="cpunode1", state="idle" }   1
slurm_node_state{ node="cpunode2", state="alloc" }  1
slurm_node_state{ node="cpunode3", state="drain" }  1

We want to calculate what percentage of the time a node is in the 'alloc' state. Because the metric may be missing some of the time, we can't just average it out over time any more; the average of a bunch of 1's and a bunch of missing metrics is 1. The simplest approach is to use a subquery, like this:

sum_over_time( slurm_node_state{ state="alloc" }[$__range:1m] ) /
   ($__range_s / 60)

The reason we're using a subquery instead of simply a time range is so that we can control how many sample points there are over the time range, which gives us our divisor to determine the average. The relationship here is that we explicitly specify the subquery range step (here 1 minute aka 60 seconds) and then we divide the total range duration by that range step. If you change the range step, you also have to change the divisor or get wrong numbers, as I have experienced the hard way when I was absent-minded and didn't think this one through.

If we want to know the total time in seconds that a node was allocated, we would multiply by the range step in seconds instead of dividing:

sum_over_time( slurm_node_state{ state="alloc" }[$__range:1m] ) * 60

Now let's suppose that we have a more complicated metric that isn't always 1 when the thing is active but that's still absent entirely when there's no activity (instead of being 0). As an example, I'll use the count of connections a user has to one of our VPN servers, which has a set of metrics like this:

vpn_user_sessions{ server="vpn1", user="cks" }  1
vpn_user_sessions{ server="vpn2", user="cks" }  2
vpn_user_sessions{ server="vpn1", user="fred" } 1

We want to work out the percentage of time or amount of time that any particular user has at least one connection to at least one VPN server. To do this, we need to start with a PromQL expression that is 1 when this condition is true. We'll use the same basic trick for crushing multiple metric points down to one that I covered in counting the number of distinct labels:

sum(vpn_user_sessions) by (user) > bool 0

The '> bool 0' turns any count of current sessions into 1. If the user has no sessions at the moment to any VPN servers, the metric will still be missing (and we can't get around that), so we still need to use a subquery to put this all together to get the percentage of usage:

sum_over_time(
   (sum(vpn_user_sessions) by (user) > bool 0)[$__range:1m]
) / ($__range_s / 60)

As before, if we want to know the amount of time in seconds that a user has had at least one VPN connection, we would multiply by 60 instead of doing the division. Also as before, the range step and the '60' in the division (or multiplication) are locked together; if you change the range step, you must change the other side of things.

Sidebar: A subquery trick that doesn't work (and why)

On the surface, it seems like we could get away from the need to do our complicated division by using a more complicated subquery to supply a default value. You could imagine something like this:

avg_over_time(
 ( slurm_node_state{ state="alloc" } or vector(0) )[$__range:]
)

However, this doesn't work. If you try it interactively in the Prometheus query dashboard, you will probably see that you get a bunch of the metrics that you expect, which all have the value 1, and then one unusual one:

{} 0

The reason that 'or vector(0)' doesn't work is that we're asking Prometheus to be superintelligent, and it isn't. What we get with 'vector(0)' is a vector with a value of 0 and no labels. What we actually want is a collection of vectors with all of the valid labels that we don't already have as allocated nodes, and Prometheus can't magically generate that for us for all sorts of good reasons.

Written on 02 December 2019.
« Operating spam and malware filtering is ultimately a social problem
You can have Grafana tables with multiple values for a single metric (with Prometheus) »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Dec 2 00:09:48 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.