## Counting how many times something started or stopped failing in Prometheus

April 12, 2021

When I recently wrote about Prometheus's `changes()` function and its `resets()`, I left a larger scale issue not entirely answered. Suppose that you have a metric that is either 0 or 1, such as Blackbox's `probe_success`, and you want to know either how many times it's started failing or how many times it's stopped failing over a time interval.

Counting how many times a ((probe_success) time series has started to fail over a time interval is simple. As explained at more length in the `resets()` entry, we can simply use it:

```resets( probe_success [1d] )
```

We can't do this any more efficiently than with `resets()`, because no matter what we do Prometheus has to scan all of the time series values across our one day range. The only way this could be more efficient would be if Prometheus gained some general feature to stream through all of the time series points it has to look at over that one-day span, instead of loading them all into memory.

Counting how many times a `probe_success` time series has started to succeed (after a failure) over the time interval is potentially more complex, depending how much you care about efficiency. The straightforward answer is to use `changes()` to count how many times it has changed state between success and failure and then use `resets()` to subtract how many times it started to fail:

```changes( probe_success [1d] ) - resets( probe_success [1d] )
```

But unless Prometheus optimizes things, this will load one day's worth of every `probe_success` time series twice, first for `changes()` and then again for `resets()`.

One approach to avoiding this extra load is to count changes and divide by two, but this goes wrong if the probe started out in a different state than it finished. If this happens, `changes()` will be odd and we will have a fractional success, which needs to be rounded down if the probe started out succeeding and rounded up if the probe started out failing. We can apparently achieve our desired rounding in a simple, brute force way as follows:

```floor( ( changes( probe_success[1d] ) + probe_success )/2 )
```

What this does at one level is add one to `changes()` if the probe was succeeding at the end of the time period. This extra change doesn't matter if the probe started out succeeding, because then `changes()` will be even, the addition will make it odd, and then dividing by two and flooring will ignore the addition. But if the probe started out failing and ended up succeeding, `changes()` will be odd, the addition will make it even, and dividing by two will 'round up'.

However, this has the drawback that it will completely ignore time series that didn't exist at the end of the time period. Because addition in Prometheus does set union and the disappeared time series aren't present in the right side set, their `changes()` disappears entirely. As far as I can see, there is no way out of this that avoids a second full scan of your metric over the time range. At that point you might as well use `resets()`.

For dashboard display purposes you might accept the simple '`changes()/2`' approach with no clever compensation for odd `changes()` values, and add a note about why the numbers could have a <N>.5 value. Not everything on your dashboards and graphs has to be completely, narrowly correct all of the time even at the cost of significant overhead.

(This is one of the entries I'm writing partly for my future self. I'd hate to have to re-derive all of this logic in the future when I already did it once.)

Written on 12 April 2021.