Counting how many times something started or stopped failing in Prometheus
When I recently wrote about Prometheus's changes()
function and its resets()
, I left a larger scale issue not entirely
answered. Suppose that you have a metric that is either 0 or 1, such
as Blackbox's
probe_success
, and you want to know either how many times it's
started failing or how many times it's stopped failing over a time
interval.
Counting how many times a ((probe_success) time series has started
to fail over a time interval is simple. As explained at more length
in the resets()
entry, we can simply use it:
resets( probe_success [1d] )
We can't do this any more efficiently than with resets()
, because
no matter what we do Prometheus has to scan all of the time series
values across our one day range. The only way this could be more
efficient would be if Prometheus gained some general feature to
stream through all of the time series points it has to look at over
that oneday span, instead of loading them all into memory.
Counting how many times a probe_success
time series has started
to succeed (after a failure) over the time interval is potentially
more complex, depending how much you care about efficiency. The
straightforward answer is to use changes()
to count how many times
it has changed state between success and failure and then use
resets()
to subtract how many times it started to fail:
changes( probe_success [1d] )  resets( probe_success [1d] )
But unless Prometheus optimizes things, this will load one day's
worth of every probe_success
time series twice, first for
changes()
and then again for resets()
.
One approach to avoiding this extra load is to count changes and
divide by two, but this goes wrong if the probe started out in a
different state than it finished. If this happens, changes()
will
be odd and we will have a fractional success, which needs to be
rounded down if the probe started out succeeding and rounded up if
the probe started out failing. We can apparently achieve our desired
rounding in a simple, brute force way as follows:
floor( ( changes( probe_success[1d] ) + probe_success )/2 )
What this does at one level is add one to changes()
if the probe
was succeeding at the end of the time period. This extra change
doesn't matter if the probe started out succeeding, because then
changes()
will be even, the addition will make it odd, and then
dividing by two and flooring will ignore the addition. But if the
probe started out failing and ended up succeeding, changes()
will
be odd, the addition will make it even, and dividing by two will
'round up'.
However, this has the drawback that it will completely ignore time
series that didn't exist at the end of the time period. Because addition
in Prometheus does set union and the disappeared time series aren't
present in the right side set, their changes()
disappears entirely.
As far as I can see, there is no way out of this that avoids a second
full scan of your metric over the time range. At that point you might as
well use resets()
.
For dashboard display purposes you might accept the simple
'changes()/2
' approach with no clever compensation for odd
changes()
values, and add a note about why the numbers could have
a <N>.5 value. Not everything on your dashboards and graphs has to
be completely, narrowly correct all of the time even at the cost of
significant overhead.
(This is one of the entries I'm writing partly for my future self. I'd hate to have to rederive all of this logic in the future when I already did it once.)

