Counting the number of distinct labels in a Prometheus metric
Suppose, not hypothetically, that you're collecting Prometheus metrics on your several VPN servers, including a per user count of sessions on each server. The resulting metric looks like this:
vpn_user_sessions{ user="cks", server="vpn1", ... } 1 vpn_user_sessions{ user="fred", server="vpn1", ... } 1 vpn_user_sessions{ user="cks", server="vpn2", ... } 1
We would like to know how many different users are currently connected
across our entire collection of VPN servers. As we see here, the
same user may be connected to multiple VPN servers for whatever
reason, including that different devices prefer to use different
VPN software (such as L2TP or OpenVPN). In Prometheus terms, we
want to count the number of distinct label values in vpn_user_sessions
for the 'user
' label, which I will shorten to the number of
distinct labels.
To do this, our first step is to somehow reduce this down to something
with one metric point per user, with no other labels. Throwing
away labels is done with the 'by (...)
' modifier to PromQL
aggregation operators.
For our purposes we can use any of the straightforward operators
such as sum
, min
, or max
; I'll use sum
. Using 'sum(...)
by (user)
' will produce a series like this:
{ user="cks" } 2 { user="fred" } 1
Having generated this new vector, we simply count how many elements
are in it with count()
. The final expression is:
count( sum( vpn_user_sessions ) by (user) )
This will give us the number of different users that are connected right now.
Next, suppose that we want to know how many different users have used our VPNs over some span of time, such as the past day. To do this in the most straightforward way, we'll start by basically aggregating our time spam down to something that has an element (with a full set of labels) if the user was connected to a particular VPN server at some point in the time span. Since we don't care about the values, we can use any reasonable <aggregation>_over_time function, such as 'min':
min_over_time( vpn_user_sessions[24h] )
(The choice of aggregation to use is relatively arbitrary; we're using it to sweep up all of the different sets of labels that have appeared in the last 24 hours, not for its output value. Min does this and is simple to compute.)
This gives us an instant vector that we can then process in the
same way as we did with vpn_user_sessions
when we generated our
number of currently connected users; we aggregate it to get rid of
all labels other than 'user
', and then we count how many distinct
elements we have. The resulting query is:
count( sum( min_over_time( vpn_user_sessions[24h] ) ) by (user) )
This is not the only way to create a query that does this, but it's the simplest and probably also the best performing.
(I initially wrote a 'how many different users over time' query that didn't produce correct numbers, which I didn't realize until I tested it, and then my next attempt used a subquery and some brute force. It wasn't until I sat down to systematically work out what I wanted and how to get there that I came up with these current versions. This is a valuable learning experience; whenever I'm faced with a complex PromQL query situation, I shouldn't just guess, I should tackle the problem systematically, building up the solution in steps and verifying each one interactively.)
PS: It's possible that this trick is either well known or obvious, but if so I couldn't find it in my initial Internet searches before I started flailing around writing my own queries.
|
|