Some uses for Prometheus's resets()
function
One of the functions available in Prometheus's PromQL
query language is resets()
, which is described this way in the
documentation:
For each input time series,
resets(v range-vector)
returns the number of counter resets within the provided time range as an instant vector. Any decrease in the value between two consecutive samples is interpreted as a counter reset.
resets
should only be used with counters.
Up until recently I've ignored resets()
because knowing when a
counter had reset didn't seem particularly useful to me. This changed
due to roidelapluie's comment about it on this entry (I'll get to it), which caused me
to start thinking about resets()
in general. But before I get to
its possible uses, there's an important qualification on the
documentation.
If a time series disappears for a while then reappears, the last
value from before the disappearance is consecutive with the first
value after it reappears. All that resets()
cares about is the
stream of values when a time series exists, so if the post-appearance
value is lower the the pre-disappearance value, this is counted as
a reset. Much like the changes()
function,
the code that evaluates this is completely blind to there even being
periods of time where the time series isn't there. By extension, if
a counter time series disappears for a while but comes back with the
same value or higher, this won't be considered a reset.
(This makes reasonable sense if you think about scrape failures. You don't really want a scrape failure or a series of them to make Prometheus declare that counters have reset.)
As pointed out by roidelapluie, the first thing you can do with
resets()
is apply it to a continuous metric that's either 0 or
1, such as a success metric from Blackbox, in order to count
how many times the time series has started to fail over a time
interval (or more generally has gone to 0). Resets()
is blind to
what type the metric is, so when probe_success
goes from 1 to
0 it will happily consider this a counter reset and count it for you.
(This won't work so well if the counter can take on additional
non-zero values, because then every time the value goes down
resets()
will think it reset, even if it didn't go all the way
down to 0. There are workarounds for this.)
Another thing you can do is apply resets()
to non-continuous
metrics that drop (or reset) when something interesting happens.
For example, suppose you have a metric that is how many packets a
VPN user's session has transmitted or received (with the time series
for a user being absent if they have no sessions). You can be pretty
certain that when a session is shut down and a new one started up,
the new session will have a lower packet count than the old one.
Given this, you can use resets()
to count more or less how many
times the user shut down their old VPN connection and made a new
one over a time range, even if there was some time between the old
session being shut down and the new one starting.
(As mentioned in my entry on changes()
,
it's probably better if you have a metric for the start time of a
user's VPN session. But you may not always have that data, especially
in a Prometheus metric.)
Applying resets()
to a continuous metric that counts up until
something happens will obviously tell you how many times that thing
has happened over your time range. However, most counter metrics are
associated with more directly usable indicators of things like host
reboots or process restarts, and most gauge metrics are too likely
to go down on their own instead of resetting.
The final use for resets()
I can think of is telling you how many
time a gauge metric goes down (as always, over some time range). This
can be combined with changes()
to let you determine how many times
a gauge metric has gone up over a time range:
changes( some_metric[1h] ) - resets (some_metric[1h] )
I don't know if Prometheus will optimize this to only load the time
series points for some_metric[1h]
into memory once, or if it
will do two loads (one for changes()
, one for resets()
). This
might be an especially relevant thing to consider if you're using
a subquery instead of a simple metric lookup.
Sidebar: resets()
isn't less efficient if there are lots of resets
As far as I can tell from the code, resets()
is just as efficient
on metrics that have lots of resets as it is on metrics with only
a few of them. So is changes()
. In both cases, the code does a
simple scan over all of the points in a time series and counts up
the number of times it sees the relevant thing happening. The
relevant Go code for resets()
is small enough to just put here:
resets := 0 prev := samples.Points[0].V for _, sample := range samples.Points[1:] { current := sample.V if current < prev { resets++ } prev = current }
The changes()
code is the same thing with the condition changed
slightly (including to account for that NaN's don't compare equal to each
other). You can find all of the code in promql/functions.go.
|
|