Using Prometheus subqueries to look for spikes in rates

March 6, 2019

One of the things that people traditionally want to do with graphs of things like network bandwidth or disk IO rates is to look for brief spikes in usage (or brief dips, depending on what you're hunting for). The easy case of this is when the resolution of your graph is high enough that it can display the instantaneous rate at every metric point; this will always capture and display both spikes and dips. When the resolution of your graph is not this high, the first thing you need to do is decide what you care about, because you have to condense the information from multiple metrics points into one point in some way. If the instantaneous rate at the five sample points that are going to be displayed as one point are 5, 1, 5, 20, 5, what do you opt to display? If you care about spikes, you want the maximum; if you care about dips, you want the minimum.

(If you care about both, you need to plot two separate lines or otherwise display more than a single piece of information for that point.)

For the rest of this I'm going to assume that we care about the maximum. In PromQL, the way to get the maximum value over a time range of a simple gauge metric like the 1-minute load average is the max_over_time function:

max_over_time( node_load1 [$__interval] )

Here the $__interval Grafana templating variable is standing in for whatever range we are condensing into one point on the graph.

However, a lot of rates of things like network bandwidth or disk IO are not represented in Prometheus metrics as instantaneous gauges of the activity level over some time range; instead their metrics are running counters (for good reason) and you need to use either rate() or irate() to turn them into a simple number. Before Prometheus introduced subqueries, you couldn't combine irate() with max_over_time, so you couldn't do this query (at least as an ad-hoc thing).

With subqueries, we can write it out like this, assuming that our scrape interval is 15 seconds:

max_over_time( ( irate( our_counter_metric[45s] ) )[$__interval:10s] )

The 45s is the irate() sample range and 10s is the query step of the subquery; the $__interval is the same as before. Since this is a high resolution (sub)query, we're using irate() for reasons covered in rate() versus irate().

The irate() over 45 seconds is straightforward; there should always be at least two metric points within the past 45 seconds from whatever instant irate() is evaluated at, and irate() will use the two most recent ones to compute its per second rate. Our 10 second query step is deliberately smaller than our scrape interval to be cautious, so that we're guaranteed that at least one irate() will use every pair of metric points in our interval. We're oversampling, but since we're just taking the maximum of all of the computed per-second rates it doesn't matter if we duplicate samples. It could matter if we were doing something more sophisticated, in which case you would have to figure out how careful you really need to be.

The 45s and 10s will vary depending on what your scrape interval is for the particular metric you care about. The general form is that the irate()'s sample range must be bigger than twice your scrape rate, while the query step must be no larger than the scrape interval and might be somewhat smaller to be cautious.

Of course your scrape interval itself puts a limit on how brief a spike or dip you can detect at all or detect reliably, especially for rates that are counters. If you sample a counter every fifteen seconds, you can't really see a five second burst.

(All of this is started by Why irate from Prometheus doesn't capture spikes, which posed a challenge about doing this with subqueries that sparked a Reddit comment from me here.)

Written on 06 March 2019.
« A surprisingly arcane little Unix shell pipeline example
Our problem with Netplan and routes on Ubuntu 18.04 »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Mar 6 01:17:27 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.