Selecting metrics to gather mostly based on what we can use
Partly due to our situation with our L2TP VPN servers, I've set up a system to capture some basic usage metrics from our VPN servers. They don't directly provide metrics; instead we have to parse the output of things like OpenBSD's npppctl and use the information to generate various metrics from the raw output. As part of doing this, I had to figure out what metrics we wanted to generate.
I could have generated every single metric I can think of a way to get, but this is probably not what we want. We have to write custom code to do this, so the more metrics we gather the more code I have to write and the more complicated the program gets. Large, complicated programs are overhead, and however nifty it would be, we shouldn't pay that cost unless we're getting value from it. Writing a lot of code to generate metrics that we never look at and that just clutter up everything is not a good idea.
At the same time, I felt that I didn't want to go overboard with minimalism and put in only the metrics I could think of an immediate use for and that were easy to code. There's good reason to gather additional information (ie, generate additional metrics) when it's easy and potentially useful; I don't necessarily know what we'll be interested in the future, or what questions we'll ask once we have data available.
In the end I started out with some initial ideas, then refined them as I wrote the code, looked at the generated metrics, thought about how I would use them, and so on. After having both thought about it in advance and seen how my initial ideas changed, I've wound up with some broad thoughts for future occasions (and some practical examples of them).
To start with, I should generate pretty much everything that's directly in the data I have, even if I can't think of an immediate use for it. For our VPN servers this is straightforward things like how many connections there are and the number of connected users (not necessarily the same thing, since user may be connected from multiple devices). Our OpenVPN servers also provide some additional per connection information, which I eventually decided I shouldn't just ignore. We may not use it, but the raw information is there and it's a waste to discard it rather than turn it into metrics.
I also generate some metrics which require additional processing because they seem valuable enough; for these, I should see a clear use, with them letting us answer some questions I'm pretty sure we're interested in. For our VPN servers the largest one (in terms of code) is a broad classification of the IP addresses that people are connecting from; if it's from our wireless network, from our internal networks, from the outside world, and a few other categories. This will give us at least some information about how people are using our VPN. I also generate the maximum number of connections a single IP address or user has. I could have made this a histogram of connections per IP or per user, but that seemed like far too much complexity for the likely value; for our purposes we probably mostly care about the maximum we see, especially since most of the time everyone only has one connection from one remote IP.
(Multiple connections from the same IP address but for different people can happen under various circumstances, as can multiple connections from the same IP for the same person. If the latter sounds odd to you, consider the case of someone with multiple devices at home that are all behind a single NAT gateway.)
Potentially high value information is worth gathering even if it doesn't quite fit neatly into the mold of metrics or raises things like cardinality concerns in Prometheus. For our VPN metrics, I decided that I should generate a metric for every currently connected user so that we can readily track that information, save it, and potentially correlate it to other metrics. Initially I was going to make the value of this metric be '1', but then I realized that was silly; I could just as well make it the number of current connections the user has (which will always be at least 1).
(In theory this is a potentially high cardinality metric. In practice we don't have that many people who use our VPNs, and I've discovered that sometimes high cardinality metrics are worth it. While the information is already in log files, extracting it from Prometheus is orders of magnitude easier.)
In general I should keep metrics de-aggregated as much as is reasonable. At the same time, some things can be worth providing in pre-aggregated versions. For instance, in theory I don't need to have separate metrics for the number of connections and the number of connected users, because they can both be calculated from the per-user connection count metric. In practice, having to do that computation every time is annoying and puts extra load on Prometheus.
However, I also need to think about whether aggregating things together actually makes sense. For instance, OpenVPN provides per connection information on the amount of data sent and received over that connection. It looks tempting to pre-aggregate this together into a 'total sent/received by VPN server' metric, but that's dangerous because connections come and go; our aggregated metric would bounce around in a completely artificial way. Aggregating by user is potentially dangerous, but we have to do something to get stable and useful flow identifiers and most of the time a user only has one connection.
Capturing command output in a Bourne shell variable as a brute force option
Often, the natural form of generating and then processing something in the Bourne shell is as a pipeline:
smartctl -A /dev/$dsk | tr A-Z- a-z_ | fgrep -v ' unknown_' | awk '<process more>' timeout 30s ssh somehost npppctl session brief | awk '<generate metrics>'
(Using awk is not necessarily recommended here, but it's the neutral default.)
However, there can be two problems with this. First, sometimes you want to process the command's output in several different ways, but you only want to run the command once (perhaps it's expensive). Second, sometimes you want to reliably detect that the initial command failed, or even not run any further steps if it failed because you don't trust that the output it generates on failure won't confuse the rest of the pipeline and produce bad results.
The obvious solution to this is to write the output of the first
command into a temporary file, which you can then process and
re-process as many times as you want. You can also directly check
the first command's exit status (and results), and only proceed if
things look good. But the problem with temporary files is that
they're kind of a pain to deal with. You have to find a place to
put them, you have to name them, you have to deal with them securely
if you're putting them in
$TMPDIR more generally), you
have to remove them afterward (including removing them on error),
and so on. There is a lot of bureaucracy and overhead in dealing
with temporary files and it's easy to either miss some case or be
tempted into cutting corners.
Lately I've been leaning on the alternate and somewhat brute force option of just capturing the command's output in the shell script, putting it into a shell variable:
smartout="$(smartctl -A /dev/$dsk)" if [ $? -ne 0 ] ; then .... fi echo "$smartout" | tr A-Z- a-z_ | .... echo "$smartout" | awk '<process again>'
(Checking for empty output is optional but probably recommended.)
In the old Unix days when memory was scarce, this would have been horrifying (and potentially dangerous). Today, that's no longer really the case. Unless your commands generate a very large amount of output or something goes terribly wrong, you won't notice the impact of holding the entire output of your command in the shell's memory. In many cases the command will produce very modest amounts of output, on the order of a few Kb or a few tens of Kb, which is a tiny drop in the bucket of modern Bourne shell memory use.
(And if the command goes berserk and produces a giant amount of output, writing that to a file would probably have been equally much of a problem. If you hold it in the shell's memory, at least it automatically goes away if and when the shell dies.)
Capturing command output in shell variable solves all of my problems here. Shell variables don't have any of the issues of temporary files, they let you directly see the exit status of the first command in what would otherwise be the pipeline, and you can repeatedly re-process them through different additional things. I won't say it's entirely elegant, but it works and sometimes that (and simplicity) is my priority.