Wandering Thoughts


The value of automation having ways to shut it off (a small story)

We have some old donated Dell C6220 blades that we use as SLURM based compute servers. Unfortunately, these machines appear to have some sort of combined hardware and software fault that causes them to lock up periodically under some loads (building Go from source with full tests is especially prone to triggering it). Fortunately these machines support IPMI and so can be remotely power cycled, and a while back we got irritated enough at the lockups that we set up their IPMIs and built a simple cron-based set of scripts to do this for us automatically.

(The scripts take the simple approach of detecting down machines through looking for alerts in our Prometheus system. To avoid getting in our way, they only run outside of working hours; during the working day, if a Dell C6220 blade goes down we have to run the 'power cycle a machine via IPMI' script by hand against the relevant machine. This lets us deliberately shut down machines without having them suddenly restarted on us.)

All of these Dell C6220 blades are located in a secondary machine room that has the special power they need. Unfortunately, this machine room's air conditioner seems to have developed some sort of fault where it just stops working until you turn it off, wait a bit, and turn it back on. Of course this isn't happening during the working day; instead it's happened in the evenings or night (twice, recently). When this happens and we see the alerts from our monitoring system, we notify the relevant people and then power off all or almost all of the servers in the room, including the Dell C6220 blades.

You can probably see where this is going. Fortunately we thought of the obvious problem here before we started powering down the C6220 blades, so both times we just manually disabled the cron job that auto-restarts them. However, you can probably imagine what sort of problems we might have if we had a more complex and involved system to automatically restart nodes and servers that were 'supposed' to be up; in an unusual emergency situation like this, we could be fighting our own automation if we hadn't thought ahead to build in some sort of shutoff switch.

Or in short, when you automate something, think ahead to how you'll disable the automation if you ever need to. Everything needs an emergency override, even if that's just 'remove the cron job that drives everything'.

It's fine if this emergency stop mechanism is simple and brute force. For example, our simple method of commenting out the cron job is probably good enough for us. We could build a more complex system (possibly with finer-grained controls), but it would require us to remember (or look up) more about how to shut things off.

We could also give the auto-restart system some safety features. An obvious one would be to get the machine room temperature from Prometheus and refuse to start up any of the blade nodes if it's too hot. This is a pretty specific safety check, but we've already had two AC incidents in close succession so we're probably going to have more. A more general safety check would be to refuse to turn on blades if there were too many down, on the grounds that a lot of blades being down is almost certainly not because of the problem that the script was designed to deal with.

AutomationShutoffValue written at 23:54:20; Add Comment


CUPS's page log, its use of SNMP, and (probably) why CUPS PPDs turn that off

One of CUPS's (also) many features is that it can (try to) log reports of how many pages got printed to what printer by who; this is set by the PageLog directive in cups-files.conf. We set this in our CUPS configuration because it's a useful aid for debugging and broadly tracking usage, probably as do many people who run CUPS as a central print server. I've recently been exploring some of the things involved in this logging, due to discovering that our CUPS server wasn't reporting some print jobs to our page log in a pattern that's not clear to us.

In order to generate the page log, CUPS generally needs to know how many total pages the printer had printed before the job and then after it. In the old days, CUPS had various arcane ways of extracting this information from printers, traditionally involving Postscript queries. In the modern world, it turns out that CUPS tries to get page usage information from the printer using SNMP queries (using SNMP MIBs that are apparently standard between printer companies, or at least standard enough); in CUPS terminology, this is querying printers for 'supply levels'. In theory this has some implications if you have firewalls between your printers and your print server and the firewalls might be blocking SNMP.

In practice, if you look at modern CUPS PPDs, you will find that a certain number of them have an interesting line in them:

*cupsSNMPSupplies: False

As covered in the documentation, this turns off making SNMP queries to find out supply information, including the page count.

As it turns out, there is probably a good reason for this setting, because how CUPS does SNMP queries appears to have a little issue in practice that makes them not work. What I've observed in testing is that CUPS doesn't poll the printer to make sure it's finished producing output before getting the 'after printout' SNMP page usage numbers. At least some printers will accept the print job well before they actually have finished producing all of the pages (or even any of them), and of course they haven't updated their 'total pages printed' count at that point. The practical effect is that CUPS makes two back to back SNMP queries, gets the same page count from both, and doesn't record anything to your page log.

(Discovering this made the mystery for some of our printers not why we didn't have page logs, but why we had them for some jobs under some circumstances. Clearly CUPS has some additional sources of page count information that are sometimes available and sometimes not.)

PS: I'm not sure this CUPS and printer behavior can be said to be either side's fault; it seems more like a case of mismatched assumptions. Presumably there are some printers where it does work and produces useful per-job page count information.

CUPSPageLogAndSNMP written at 00:53:40; Add Comment


Why I prefer the script exporter for exposing script metrics to Prometheus

Suppose that you have some scripts that you use to extract and generate Prometheus metrics for targets, and these scripts run on your Prometheus server. These metrics might be detailed SNTP metrics of (remote) NTP servers, IMAP and POP3 login performance metrics, and so on. You have at least three methods to expose these script metrics to Prometheus; you can run them from cron and publish through either node_exporter's textfile collector or Pushgateway, or you can use the third part script_exporter to run your scripts in response to Prometheus scrape requests (and return their metrics). Having used all three methods to generate metrics, I've come to usually prefer using the script exporter except in one special case.

Conceptually, in all three methods you're getting metrics from some targets. In the cron-based methods, what targets you're getting what metrics from (and how frequently) is embedded in and controlled by scripts, cron.d files, and so on, not in your Prometheus configuration the way your other targets are. In the script exporter method, all of that knowledge of targets and timing is in your Prometheus configuration, just like your other targets. And just like other targets, you can configure additional labels on some of your script exporter scrapes, or have different timings, or so on, and it's all controlled in one place. If some targets need some different checking options, you can set that in your Prometheus configuration as well.

You can do all of this with cron based scripts, but you start littering your scripts and cron.d files and so on with special cases. If you push it far enough, you're basically building your own additional set of target configurations, per-target options, and so on. Prometheus already has all of that ready for you to use (and it's not that difficult to make it general with the usual tricks, or the label based approach).

There are two additional benefits from directly scraping metrics. First, the metrics are always current instead of delayed somewhat by however long Prometheus takes to scrape Pushgateway or the host agent. Related to this, you get automatic handling of staleness if something goes wrong and scrapes start failing. Second, you have a directly exposed metric for whether the scrape worked or whether it failed for some reason, in the form of the relevant up and script_success metrics. With indirect scraping you have to construct additional things to generate the equivalents.

The one situation where this doesn't work well is when you want a relatively slow metric generation interval. Because you're scraping directly, you have the usual Prometheus limitation where it considers any metric more than five minutes old to be stale. If you want to do your checks and generate your metrics only once every four or five minutes or slower, you're basically stuck publishing them indirectly so that they won't regularly disappear as stale, and this means one of the cron-based methods.

PrometheusScriptExporterWhy written at 01:57:07; Add Comment


Three ways to expose script-created metrics in Prometheus

In our Prometheus environment, we've wound up wanting (and creating) a bunch of custom metrics that are most naturally created through a variety of scripts. Some of these are general things that are simply not implemented to our tastes in existing scripts and exporters, such as SMART disk metrics, and some of these are completely custom metrics for our environment, such as our per-user, per-filesystem disk space usage information or information from our machine room temperature sensors (which come from an assortment of vendors and have an assortment of ways of extracting information from them). When you're generating metrics in scripts, you need to figure out how to get these metrics from the script into Prometheus. I know of three different ways to do this, and we've used all three.

The first and most obvious way is to have the script publish the metrics to Pushgateway. This requires very little from the host that the script is running on; it has to be able to talk to your Pushgateway host and it needs a HTTP client like curl or wget. This makes Pushgateway publication the easiest approach when you're running as little as possible on the script host. It has various drawbacks that can be boiled down to 'you're using Pushgateway', such as you having to manually check for metrics going stale because the script that generates them is now failing.

On servers where you're running node_exporter, the Prometheus host agent, the simplest approach is usually to have scripts expose their metrics through the textfile collector, where they write a text file of metrics into a particular directory. We wrote a general wrapper script to support this, which handles locking, writing the script's output to a temporary file, and so on, so that our metrics generation scripts only have to write everything to standard output and exit with a success status.

(If a script fails, our wrapper script removes that particular metrics text file to make the metrics go stale. Now that I'm writing this entry, I've realized that we should also write a script status metric for the script's exit code, so we can track and alert on that.)

Both of these methods generally run the scripts through cron, which generally means that you generate metrics at most once a minute and they'll be generated at the start of any minute that the scripts run on. If you scrape your Pushgateway and your host agents frequently, Prometheus will see updated metrics pretty soon after they're generated (a typical host agent scrape interval is 15 seconds).

The final way we expose metrics from scripts is through the third party script_exporter daemon. To quote its Github summary, it's a 'Prometheus exporter to execute scripts and collect metrics from the output or the exit status'. Essentially it's like the Blackbox exporter, except instead of a limited and hard-coded set of probes you have a whole collection of scripts generating whatever metrics that you want to write and configure. The script exporter lets these scripts take parameters, for example to select what target to work on (how this works is up to each script to decide).

Unlike the other two methods, which are mostly configured on the machines running the scripts, generating metrics through the script exporter has to be set up in Prometheus by configuring a scrape configuration for it with appropriate targets defined (just like for Blackbox probes). This has various advantages that I'm going to leave for another entry.

Because you have to set up an additional daemon for the script exporter, I think it works best for scripts that you don't want to run on multiple hosts (they can target multiple hosts, though). In this it's much like the Blackbox exporter; you normally run one Blackbox exporter and use it to check on everything (or a few of them if you need to check from multiple vantage points or check things that are only reachable from some hosts). You certainly could run a script exporter on each machine and doing so has some advantages over the other two ways, but it's likely to be more work compared to using the textfile collector or publishing to Pushgateway.

(It also has a different set of security issues, since the script exporter has to be exposed to scraping from at least your Prometheus servers. The other two approaches don't take outside input in any way; the script exporter minimally allows the outside to trigger specific scripts.)

PS: All of these methods assume that your metrics are 'the state of things right now' style metrics, where it's harmless and desired to overwrite old data with new data. If you need to accumulate metrics over time that are generated by scripts, see using the statsd exporter to let scripts update metrics.

PrometheusScriptMetricsHow written at 00:49:25; Add Comment


The history and background of us using Prometheus

On Prometheus and Grafana after a year, a commentator asked some good questions:

Is there a reason why you went with a "metrics-based" (?) monitoring solution like Prometheus-Grafana, and not a "service-based" system like Zabbix (or Nagios)? What (if anything) was being used before the current P-G system?

I'll start with the short answer, which is that we wanted metrics as well as alerting and operating one system is simpler than operating two, even if Prometheus's alerting is not necessarily as straightforward as something intended primarily for that. The longer answer is in the history of how we got here.

Before the current Prometheus system, what we had was based on Xymon and had been in place sufficiently long that portions of it talked about 'Hobbit' (the pre-2009 name of Xymon, cf). Xymon as we were operating it was almost entirely a monitoring and alerting system, with very little to nothing in the way of metrics and metrics dashboards. We've understood for a long time that having metrics is important and we wanted to gather and use them, but we had never managed to turn this desire into actually doing anything (at one point I sort of reached a decision on what to build, but then I never actually built anything for various reasons).

In the fall of 2018 (last year), our existing Xymon setup reached a critical point where we couldn't just let it be, because it was hosted on an Ubuntu 14.04 machine. For somewhat unrelated reasons I wound up looking at Prometheus, and its quick-start demonstration sold me on the idea that it could easily generate useful metrics in our environment (and then let us see them in Grafana). My initial thoughts were to split metrics apart from alerting and to start by setting up Prometheus as our metrics system, then figure out alerting later. I set up a testing Prometheus and Grafana for metrics on a scratch server around the start of October.

Since we were going to run Prometheus and it had some alerting capabilities, I explored if it could more or less sufficiently cover our alerting needs. It turned out that it could, although perhaps not in an ideal way. However, running one system and gathering information once (more or less) is less work than also trying to pick a modern alerting system, set it up, and set up monitoring for it, especially if we wanted to do it on a deadline (with the end of Ubuntu's support for 14.04 looming up on it). We decided that we would at least get Prometheus in place now to replace Xymon, even if it wasn't ideal, and then possibly implement another alerting system later at more leisure if we decided that we needed to. So far we haven't felt a need to go that far; our alerts work well enough in Prometheus, and we don't have all that many custom 'metrics' that really exist only to trigger alerts.

(Things we want to alert on often turn out to also be things that we want to track over time, more often than I initially thought. We've also wound up doing more alerting on metrics than I expected us to.)

Given this history, it's not quite right for me to say that we chose Prometheus over other alternative metrics systems. Although we did do some evaluation of other options after I tried Prometheus's demo and started exploring it, what it basically boiled down to was we had decent confidence Prometheus could work (for metrics) and none of the other options seemed clearly better to the point where we should spend the time exploring them as well. Prometheus was not necessarily the best, it just sold us on that it was good enough.

(Some of the evaluation criteria I used turned out to be incorrect, too, such as 'is it available as an Ubuntu package'. In the beginning that seemed like an advantage for Prometheus and anything that was, but then we wound up abandoning the Ubuntu Prometheus packages as being too out of date.)

PrometheusWhyHistory written at 00:29:44; Add Comment


Prometheus and Grafana after a year (more or less)

We started up our permanent production Prometheus instance on November 21st of 2018, which means that it's now been running and saving metrics for over a year (actually over 13 months by now, because I'm bad at writing entries on time). Our Prometheus and Grafana setup hasn't been static over that time, but it also hasn't undergone any significant changes from our straightforward initial setup (which was essentially the same as our current setup, just with fewer additional third party exporters).

The current state of our Prometheus setup is that it's now a quiet, reliable, and issue free part of our infrastructure, one that we generally don't have to think about; it just sits there in the background, working as it should. Every so often we get an alert email, but not very often because we usually don't have problems. Periodically we may look at our Grafana dashboards to see how things are going and if there's anything we want to look at (I may do this more than my co-workers, because I tend to think the most about Prometheus).

In the earlier days of our deployment (especially the first six months), we had a bunch of learning experiences around things like mass alerts. I spent a fair amount of time working on alert rules, figuring what to monitor and how, working out how to do clever thing like reboot notifications, generating additional custom metrics in various ways, and building and modifying dashboards so they'd be useful, as well as the normal routine maintenance tasks. These days things are almost entirely down to the routine tasks of changing our lists of Prometheus scrape targets as we add and remove machines, and keeping up with new versions of Prometheus, Grafana, and other components.

(I still do like coming up with additional metrics and fiddling with dashboards and I indulge in it periodically, but I'm aware that this is something that I could tinker with endlessly without necessarily generating lots of value.)

Overall, I'm quite happy with how our Prometheus system has turned out. It's been trouble-free to operate and it's delivered (and continues to deliver) what we want for alerts and mostly what we want as far as dashboards go (and the failings there are mine, because I'm the one putting them together). Keeping up with new versions has been easy, and they've delivered a slow and generally reliable stream of improvements, especially in our Grafana dashboards.

(I'm not as happy with the complexity of both Prometheus and Grafana, but a lot of that complexity is probably inherent in anything with those capabilities. As far as building alerts, custom metrics, and so on goes, we would probably have had to do something similar to that for any system. We can't expect out of the box monitoring for custom systems and environments.)

At the same time, Prometheus and Grafana have not magically illuminated all of our mysterious issues. If anything, staring at Grafana dashboards and looking at direct Prometheus metrics while mysterious things were going on has made me more aware of what information we simply don't have about our systems and what they're doing. Prometheus only gives us some visibility, not perfect visibility, and that's really as expected.

(My suspicion is that we won't be able to do much better until Ubuntu 20.04 ships with a decently usable version of the eBPF toolset.)

On the whole, Prometheus has improved our life but not revolutionized it. We have better alerts and more insight than we used to, but this hasn't solved any big issues that we had before (partly because we didn't really have big issues). In some ways the largest improvement is simply that we now have more reassurance about our environment through having more visibility into it. Our dashboards mean that we can see at a glance that no TLS certificates are too close to expiring, no machines have too high a load, the mail queues are not too large, and so on (and if there are problems, we can see where they are at a glance too).

PrometheusGrafanaOneYear written at 01:02:28; Add Comment


Our setup of Prometheus and Grafana (as of the end of 2019)

I have written a fair amount about Prometheus, but I've never described how our setup actually looks in terms of what we're running and where it runs. Even though I feel that our setup is straightforward and small scale as Prometheus setups go, there's some value in actually writing it down, if only to show how you can run Prometheus in a modest environment without a lot of complexity.

The main Prometheus programs and Grafana all are on a single Dell 1U server (currently a Dell R230) running Ubuntu 18.04, with 32 GB of RAM, four disks, and an Intel dual Ethernet card. Two of the disks are mirrored system SSDs and the other two are mirrored 4 TB HDs that we use to hold the Prometheus TSDB metrics data. We use 4 TB HDs not because we have a high metrics volume but because we want a very long metrics retention time; we're currently aiming for at least four years. We use all four network ports on the Prometheus server in order to let the server be directly on several non-routed internal networks that we want to monitor machines on, in addition to our main internal (routed) subnet.

This server currently hosts Prometheus itself, Grafana, Alertmanager, Blackbox, and Pushgateway. Like almost all of our servers, it also runs the node exporter host agent. We use the upstream precompiled versions of everything, rather than the Ubuntu 18.04 supplied ones, because the Ubuntu ones wound up being too far out of date. In third party exporters, it has the script exporter, which we use for more sophisticated 'blackbox' checks, and the Apache exporter. The web servers for Prometheus, Grafana, Alertmanager, Pushgateway, and Blackbox are all behind an Apache reverse proxy that handles TLS and authentication.

As mentioned, almost all of our Ubuntu machines run the Prometheus host agent. Currently, our mail related machines also run mtail to generate some statistics from their logs, and our nVidia based GPU servers also run a hacked up version of this third party nVidia exporter. On basically all machines running the host agent, we have a collection of scripts that generate various metrics into text files for the host agent's textfile collector. Some of these are generic scripts that run on everything (for things like SMART metrics), but some are specific to certain sorts of machines with certain services running. The basic host agent and associated scripts and /etc/cron.d files are automatically installed on new machines by our install system; other things are set up as part of our build instructions for specific machines.

(I've sort of kept an eye on Grafana Loki but haven't actively looked into using it anywhere. I haven't actively explored additional Prometheus exporters; for the most part, our system level metrics needs are already covered.)

Prometheus, Alertmanager, and so on are all configured through static files, including for what targets Prometheus should scrape. We maintain all of these by hand (although they're in a Mercurial repository), because we're not operating at the kind of scale or rate of changes where we need to automatically (re)generate the list of targets, our alert rules, or anything like that. We also don't try to have any sort of redundant Prometheus or Alertmanager instances; our approach for monitoring Prometheus itself is fairly straightforward and simple. Similarly, we don't use any of Grafana's provisioning features, we edit dashboards in the Grafana UI and just let it keep everything in its grafana.db file.

(Our Grafana dashboards, Prometheus alert rules, and so on are basically all locally written for own specific needs and metrics setup. I would like to extract our Grafana dashboards into a text format so I could more conveniently version them in a Mercurial repository, but that's a someday project.)

We back up the Prometheus server's root filesystem, which includes things like /etc/prometheus and the grafana.db file (as well as all of the actual programs involved), but not the Prometheus TSDB metrics database, because that's too big. If we lose both mirrored HDs at the same time (or sufficiently close to it), we'll lose our database of past metrics and will have to start saving them again from the current point in time. We've decided that a deep history of metrics is nice to have but not sufficiently essential that we're going to do better than this.

We have a collection of locally written scripts and some Python programs that generate custom metrics, either on the Prometheus server itself or on other servers that are running relevant software (or sometimes have the necessary access and vantage point). For example, our temperature sensor monitoring is done with custom scripts that are run from cron on the Prometheus server and write to Pushgateway. Some of it could have been done with the SNMP exporter, but rolling our own script was the simpler way to get started. These days, a fair number of these scripts on the Prometheus server are run through the script exporter instead of from cron for reasons that need another entry. On our other machines, all of them run from cron and most of them write files for the textfile collector; a few publish to Pushgateway.

PrometheusGrafanaSetup-2019 written at 00:51:37; Add Comment


It's a good idea to label all of the drives in your desktop

I was doing some drive shuffling with my office workstation today, so opening it up reminded me that when I originally built it up, I did the wise thing of putting labels on all of the drives in it (both the spinning rust hard drives and the SSDs). We generally don't label the drives themselves on modern servers, because most servers have their backplane or drive cables hardwired so that a drive in a given spot in the chassis is always going to be the same disk as Linux sees it. This isn't true on most desktops, where you get to run the cables yourself in any way that you want and then play the game of finding out what order your motherboard puts ports in (which is often not the order you expect; motherboards can be wired up in quite peculiar ways).

(As we've found out, there are good reasons to label the front of the server chassis with what disk is where and what a particular disk is for, especially if the disks aren't in a straightforward order. In some cases, you may want to keep a printed index of what drive is where. But that's separate from labeling the drive itself inside the carrier or chassis.)

We have a labelmaker (as should everyone), so that's what I use to label all of the drives. My current practice in labels is to label each drive with the host it's in, its normal Linux disk name (like 'sda'), and what important filesystems (or ZFS pools, or both) are on the disk. I will also sometimes label drives as 'disk 0', 'disk 1', and so on. I have two goals with all of this labeling. When a drive is in my machine, I want to be able to see which drive it is, so that if I know that 'sdb' has died (or I want to replace it), I know what drive to uncable, remove, and so on. When I pull a drive out of my machine, either temporarily or permanently, I want to know where it came from and what it has or had on it, rather than have the drive wind up as yet another mysterious and anonymous drive sitting around (I have more than enough of those as it is).

(I'm not entirely sure what my goal was with my 'disk 0' and 'disk 1' labels. I think I wanted to keep track of which part of a software RAID array the drive had been, not just which one it had been a part of.)

Much like labeling bad hardware, I should probably put an additional label on removed drives with the date that I pulled them out. If they failed, I should obviously label that too (sometimes I pull drives because I'm replacing them with better ones, which is the current case).

Unfortunately there's one sort of drive that you can't currently really label, and that's NVMe drives; unlike normal drives, they don't really have a case to put a label on. My new NVMe drives have a manufacturer's sticker over parts of the drive, but I don't want to put a label on top of any part of it for various reasons. Right now I'm just hoping that Linux and motherboards order NVMe drives in a sensible way (although I should check that).

PS: I haven't been entirely good about this on my home machine. At some point I'll be shuffling disks around on it and I should make sure everything is fully labeled then.

(This entry elaborates on something I mentioned in passing at the bottom of my entry on labeling bad hardware. Since I was making new labels for some new drives today, the issue is on my mind.)

LabelYourDesktopDrives written at 00:24:57; Add Comment


You can have Grafana tables with multiple values for a single metric (with Prometheus)

Every so often, the most straightforward way to show some information in a Grafana dashboard is with a table, for example to list how long it is before TLS certificates expire, how frequently people are using your VPN servers, or how much disk space they're using. However, sometimes you want to present the underlying information in more than one way; for example, you might want to list both how many days until a TLS certificate expires and the date on which it will expire. The good news is that Grafana tables can do this, because Grafana will merge query results with identical Prometheus label sets (more or less).

(There's a gotcha with this that we will come to.)

In a normal Grafana table, your column fields are the labels of the metric and a 'Value' field that is whatever computed value your PromQL query returned. When you have several queries, the single 'Value' field turns into, eg, 'Value #A', 'Value #B', and so on, and all of them can be displayed in the table (and given more useful names and perhaps different formatting, so Grafana knows that one is a time in seconds and another is a 0.0 to 1.0 percentage). If the Prometheus queries return the same label sets, every result with the same set of labels will get merged into a single row in the table, with all of the 'Value #<X>' fields having values. If not all sets of labels show up in all queries, the missing results will generally be shown as '-'.

(Note that what matters for merging is not what fields you display, but all of the fields. Grafana will not merge rows just because your displayed fields have the same values.)

The easiest way to get your label sets to be the same is to do the same query, just with different math applied to the query's value. You can do this to present TLS expiry as a duration and an absolute time, or usage over time as both a percentage and an amount of time (as seen in counting usage over time). A more advanced version is to do different queries while making sure that they return the same labels, possibly by either restricting what labels are returned with use of 'by (...)' and similar operators (as sort of covered in this entry).

When you're doing different queries of different metrics, an important gotcha comes up. When you do simple queries, Prometheus and Grafana acting together add a __name__ label field with the name of the metric involved. You're probably not displaying this field, but its mere presence with a different value will block field merging. To get rid of it, you have various options, such as adding '+ 0' to the query or using some operator or function (as seen in the comments of this Grafana pull request and this Grafana issue). Conveniently, if you use 'by (...)' with an operator to get rid of some normal labels, you'll get rid of __name__ as well.

All of this only works if you want to display two values for the same set of labels. If you want to pull in labels from multiple metrics, you need to do the merging in your PromQL query, generally using the usual tricks to pull in labels from other metrics.

(I'm writing this all down because I wound up doing this recently and I want to capture what I learned before I forget how to do it.)

GrafanaMultiValueTables written at 23:16:46; Add Comment

Calculating usage over time in Prometheus (and Grafana)

Suppose, not hypothetically, that you have a metric that says whether something is in use at a particular moment in time, such as a SLURM compute node or a user's VPN connection, and you would like to know how used it is over some time range. Prometheus can do this, but you may need to get a little clever.

The simplest case is when your metric is 1 if the thing is in use and 0 if it isn't, and the metric is always present. Then you can compute the percentage of use over a time range as a 0.0 to 1.0 value by averaging it over the time range, and then get the amount of time (in seconds) it was in use by multiplying that by the duration of the range (in seconds):

avg_over_time( slurm_node_available[$__range] )
avg_over_time( slurm_node_available[$__range] ) * $__range_s

(Here $__range is the variable Grafana uses for the time range in some format for Prometheus, which has values such as '1d', and $__range_s is the Grafana variable for the time range in seconds.)

But suppose that instead of being 0 when the thing isn't in use, the metric is absent. For instance, you have metrics for SLURM node states that look like this:

slurm_node_state{ node="cpunode1", state="idle" }   1
slurm_node_state{ node="cpunode2", state="alloc" }  1
slurm_node_state{ node="cpunode3", state="drain" }  1

We want to calculate what percentage of the time a node is in the 'alloc' state. Because the metric may be missing some of the time, we can't just average it out over time any more; the average of a bunch of 1's and a bunch of missing metrics is 1. The simplest approach is to use a subquery, like this:

sum_over_time( slurm_node_state{ state="alloc" }[$__range:1m] ) /
   ($__range_s / 60)

The reason we're using a subquery instead of simply a time range is so that we can control how many sample points there are over the time range, which gives us our divisor to determine the average. The relationship here is that we explicitly specify the subquery range step (here 1 minute aka 60 seconds) and then we divide the total range duration by that range step. If you change the range step, you also have to change the divisor or get wrong numbers, as I have experienced the hard way when I was absent-minded and didn't think this one through.

If we want to know the total time in seconds that a node was allocated, we would multiply by the range step in seconds instead of dividing:

sum_over_time( slurm_node_state{ state="alloc" }[$__range:1m] ) * 60

Now let's suppose that we have a more complicated metric that isn't always 1 when the thing is active but that's still absent entirely when there's no activity (instead of being 0). As an example, I'll use the count of connections a user has to one of our VPN servers, which has a set of metrics like this:

vpn_user_sessions{ server="vpn1", user="cks" }  1
vpn_user_sessions{ server="vpn2", user="cks" }  2
vpn_user_sessions{ server="vpn1", user="fred" } 1

We want to work out the percentage of time or amount of time that any particular user has at least one connection to at least one VPN server. To do this, we need to start with a PromQL expression that is 1 when this condition is true. We'll use the same basic trick for crushing multiple metric points down to one that I covered in counting the number of distinct labels:

sum(vpn_user_sessions) by (user) > bool 0

The '> bool 0' turns any count of current sessions into 1. If the user has no sessions at the moment to any VPN servers, the metric will still be missing (and we can't get around that), so we still need to use a subquery to put this all together to get the percentage of usage:

   (sum(vpn_user_sessions) by (user) > bool 0)[$__range:1m]
) / ($__range_s / 60)

As before, if we want to know the amount of time in seconds that a user has had at least one VPN connection, we would multiply by 60 instead of doing the division. Also as before, the range step and the '60' in the division (or multiplication) are locked together; if you change the range step, you must change the other side of things.

Sidebar: A subquery trick that doesn't work (and why)

On the surface, it seems like we could get away from the need to do our complicated division by using a more complicated subquery to supply a default value. You could imagine something like this:

 ( slurm_node_state{ state="alloc" } or vector(0) )[$__range:]

However, this doesn't work. If you try it interactively in the Prometheus query dashboard, you will probably see that you get a bunch of the metrics that you expect, which all have the value 1, and then one unusual one:

{} 0

The reason that 'or vector(0)' doesn't work is that we're asking Prometheus to be superintelligent, and it isn't. What we get with 'vector(0)' is a vector with a value of 0 and no labels. What we actually want is a collection of vectors with all of the valid labels that we don't already have as allocated nodes, and Prometheus can't magically generate that for us for all sorts of good reasons.

PrometheusCountUsageOverTime written at 00:09:48; Add Comment

(Previous 10 or go back to November 2019 at 2019/11/30)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.