Counting the number of distinct labels in a Prometheus metric

November 30, 2019

Suppose, not hypothetically, that you're collecting Prometheus metrics on your several VPN servers, including a per user count of sessions on each server. The resulting metric looks like this:

vpn_user_sessions{ user="cks", server="vpn1", ... }  1
vpn_user_sessions{ user="fred", server="vpn1", ... } 1
vpn_user_sessions{ user="cks", server="vpn2", ... }  1

We would like to know how many different users are currently connected across our entire collection of VPN servers. As we see here, the same user may be connected to multiple VPN servers for whatever reason, including that different devices prefer to use different VPN software (such as L2TP or OpenVPN). In Prometheus terms, we want to count the number of distinct label values in vpn_user_sessions for the 'user' label, which I will shorten to the number of distinct labels.

To do this, our first step is to somehow reduce this down to something with one metric point per user, with no other labels. Throwing away labels is done with the 'by (...)' modifier to PromQL aggregation operators. For our purposes we can use any of the straightforward operators such as sum, min, or max; I'll use sum. Using 'sum(...) by (user)' will produce a series like this:

{ user="cks" }  2
{ user="fred" } 1

Having generated this new vector, we simply count how many elements are in it with count(). The final expression is:

count( sum( vpn_user_sessions ) by (user) )

This will give us the number of different users that are connected right now.

Next, suppose that we want to know how many different users have used our VPNs over some span of time, such as the past day. To do this in the most straightforward way, we'll start by basically aggregating our time spam down to something that has an element (with a full set of labels) if the user was connected to a particular VPN server at some point in the time span. Since we don't care about the values, we can use any reasonable <aggregation>_over_time function, such as 'min':

min_over_time( vpn_user_sessions[24h] )

(The choice of aggregation to use is relatively arbitrary; we're using it to sweep up all of the different sets of labels that have appeared in the last 24 hours, not for its output value. Min does this and is simple to compute.)

This gives us an instant vector that we can then process in the same way as we did with vpn_user_sessions when we generated our number of currently connected users; we aggregate it to get rid of all labels other than 'user', and then we count how many distinct elements we have. The resulting query is:

count(
   sum(
        min_over_time( vpn_user_sessions[24h] ) 
   ) by (user)
)

This is not the only way to create a query that does this, but it's the simplest and probably also the best performing.

(I initially wrote a 'how many different users over time' query that didn't produce correct numbers, which I didn't realize until I tested it, and then my next attempt used a subquery and some brute force. It wasn't until I sat down to systematically work out what I wanted and how to get there that I came up with these current versions. This is a valuable learning experience; whenever I'm faced with a complex PromQL query situation, I shouldn't just guess, I should tackle the problem systematically, building up the solution in steps and verifying each one interactively.)

PS: It's possible that this trick is either well known or obvious, but if so I couldn't find it in my initial Internet searches before I started flailing around writing my own queries.

Written on 30 November 2019.
« The problem of multiple NVMe drives in a PC desktop today
Operating spam and malware filtering is ultimately a social problem »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Nov 30 00:07:34 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.