Using Prometheus subqueries to look for spikes in rates
One of the things that people traditionally want to do with graphs of things like network bandwidth or disk IO rates is to look for brief spikes in usage (or brief dips, depending on what you're hunting for). The easy case of this is when the resolution of your graph is high enough that it can display the instantaneous rate at every metric point; this will always capture and display both spikes and dips. When the resolution of your graph is not this high, the first thing you need to do is decide what you care about, because you have to condense the information from multiple metrics points into one point in some way. If the instantaneous rate at the five sample points that are going to be displayed as one point are 5, 1, 5, 20, 5, what do you opt to display? If you care about spikes, you want the maximum; if you care about dips, you want the minimum.
(If you care about both, you need to plot two separate lines or otherwise display more than a single piece of information for that point.)
For the rest of this I'm going to assume that we care about the
maximum. In PromQL,
the way to get the maximum value over a time range of a simple gauge
metric like the 1-minute load average is the
max_over_time( node_load1 [$__interval] )
$__interval Grafana templating variable
is standing in for whatever range we are condensing into one point on
However, a lot of rates of things like network bandwidth or disk
IO are not represented in Prometheus metrics as instantaneous gauges
of the activity level over some time range; instead their metrics
are running counters (for good reason)
and you need to use either
irate() to turn them into
a simple number. Before Prometheus introduced subqueries, you
max_over_time, so you couldn't
do this query (at least as an ad-hoc thing).
With subqueries, we can write it out like this, assuming that our scrape interval is 15 seconds:
max_over_time( ( irate( our_counter_metric[45s] ) )[$__interval:10s] )
The 45s is the
irate() sample range and 10s is the query step of the subquery; the
$__interval is the
same as before. Since this is a high resolution (sub)query, we're
irate() for reasons covered in
irate() over 45 seconds is straightforward; there should
always be at least two metric points within the past 45 seconds
from whatever instant
irate() is evaluated at, and
use the two most recent ones to compute its per second rate. Our
10 second query step is deliberately smaller than our scrape
interval to be cautious, so that we're guaranteed that at least one
irate() will use every pair of metric points in our interval.
We're oversampling, but since we're just taking the maximum of all
of the computed per-second rates it doesn't matter if we duplicate
samples. It could matter if we were doing something more sophisticated,
in which case you would have to figure out how careful you really
need to be.
The 45s and 10s will vary depending on what your scrape interval
is for the particular metric you care about. The general form is
irate()'s sample range must be bigger than twice your
scrape rate, while the query step must be no larger than the scrape
interval and might be somewhat smaller to be cautious.
Of course your scrape interval itself puts a limit on how brief a spike or dip you can detect at all or detect reliably, especially for rates that are counters. If you sample a counter every fifteen seconds, you can't really see a five second burst.
(All of this is started by Why irate from Prometheus doesn't capture spikes, which posed a challenge about doing this with subqueries that sparked a Reddit comment from me here.)