Selecting metrics to gather mostly based on what we can use
Partly due to our situation with our L2TP VPN servers, I've set up a system to capture some basic usage metrics from our VPN servers. They don't directly provide metrics; instead we have to parse the output of things like OpenBSD's npppctl and use the information to generate various metrics from the raw output. As part of doing this, I had to figure out what metrics we wanted to generate.
I could have generated every single metric I can think of a way to get, but this is probably not what we want. We have to write custom code to do this, so the more metrics we gather the more code I have to write and the more complicated the program gets. Large, complicated programs are overhead, and however nifty it would be, we shouldn't pay that cost unless we're getting value from it. Writing a lot of code to generate metrics that we never look at and that just clutter up everything is not a good idea.
At the same time, I felt that I didn't want to go overboard with minimalism and put in only the metrics I could think of an immediate use for and that were easy to code. There's good reason to gather additional information (ie, generate additional metrics) when it's easy and potentially useful; I don't necessarily know what we'll be interested in the future, or what questions we'll ask once we have data available.
In the end I started out with some initial ideas, then refined them as I wrote the code, looked at the generated metrics, thought about how I would use them, and so on. After having both thought about it in advance and seen how my initial ideas changed, I've wound up with some broad thoughts for future occasions (and some practical examples of them).
To start with, I should generate pretty much everything that's directly in the data I have, even if I can't think of an immediate use for it. For our VPN servers this is straightforward things like how many connections there are and the number of connected users (not necessarily the same thing, since user may be connected from multiple devices). Our OpenVPN servers also provide some additional per connection information, which I eventually decided I shouldn't just ignore. We may not use it, but the raw information is there and it's a waste to discard it rather than turn it into metrics.
I also generate some metrics which require additional processing because they seem valuable enough; for these, I should see a clear use, with them letting us answer some questions I'm pretty sure we're interested in. For our VPN servers the largest one (in terms of code) is a broad classification of the IP addresses that people are connecting from; if it's from our wireless network, from our internal networks, from the outside world, and a few other categories. This will give us at least some information about how people are using our VPN. I also generate the maximum number of connections a single IP address or user has. I could have made this a histogram of connections per IP or per user, but that seemed like far too much complexity for the likely value; for our purposes we probably mostly care about the maximum we see, especially since most of the time everyone only has one connection from one remote IP.
(Multiple connections from the same IP address but for different people can happen under various circumstances, as can multiple connections from the same IP for the same person. If the latter sounds odd to you, consider the case of someone with multiple devices at home that are all behind a single NAT gateway.)
Potentially high value information is worth gathering even if it doesn't quite fit neatly into the mold of metrics or raises things like cardinality concerns in Prometheus. For our VPN metrics, I decided that I should generate a metric for every currently connected user so that we can readily track that information, save it, and potentially correlate it to other metrics. Initially I was going to make the value of this metric be '1', but then I realized that was silly; I could just as well make it the number of current connections the user has (which will always be at least 1).
(In theory this is a potentially high cardinality metric. In practice we don't have that many people who use our VPNs, and I've discovered that sometimes high cardinality metrics are worth it. While the information is already in log files, extracting it from Prometheus is orders of magnitude easier.)
In general I should keep metrics de-aggregated as much as is reasonable. At the same time, some things can be worth providing in pre-aggregated versions. For instance, in theory I don't need to have separate metrics for the number of connections and the number of connected users, because they can both be calculated from the per-user connection count metric. In practice, having to do that computation every time is annoying and puts extra load on Prometheus.
However, I also need to think about whether aggregating things together actually makes sense. For instance, OpenVPN provides per connection information on the amount of data sent and received over that connection. It looks tempting to pre-aggregate this together into a 'total sent/received by VPN server' metric, but that's dangerous because connections come and go; our aggregated metric would bounce around in a completely artificial way. Aggregating by user is potentially dangerous, but we have to do something to get stable and useful flow identifiers and most of the time a user only has one connection.