Counting how many times something started or stopped failing in Prometheus

April 12, 2021

When I recently wrote about Prometheus's changes() function and its resets(), I left a larger scale issue not entirely answered. Suppose that you have a metric that is either 0 or 1, such as Blackbox's probe_success, and you want to know either how many times it's started failing or how many times it's stopped failing over a time interval.

Counting how many times a ((probe_success) time series has started to fail over a time interval is simple. As explained at more length in the resets() entry, we can simply use it:

resets( probe_success [1d] )

We can't do this any more efficiently than with resets(), because no matter what we do Prometheus has to scan all of the time series values across our one day range. The only way this could be more efficient would be if Prometheus gained some general feature to stream through all of the time series points it has to look at over that one-day span, instead of loading them all into memory.

Counting how many times a probe_success time series has started to succeed (after a failure) over the time interval is potentially more complex, depending how much you care about efficiency. The straightforward answer is to use changes() to count how many times it has changed state between success and failure and then use resets() to subtract how many times it started to fail:

changes( probe_success [1d] ) - resets( probe_success [1d] )

But unless Prometheus optimizes things, this will load one day's worth of every probe_success time series twice, first for changes() and then again for resets().

One approach to avoiding this extra load is to count changes and divide by two, but this goes wrong if the probe started out in a different state than it finished. If this happens, changes() will be odd and we will have a fractional success, which needs to be rounded down if the probe started out succeeding and rounded up if the probe started out failing. We can apparently achieve our desired rounding in a simple, brute force way as follows:

floor( ( changes( probe_success[1d] ) + probe_success )/2 )

What this does at one level is add one to changes() if the probe was succeeding at the end of the time period. This extra change doesn't matter if the probe started out succeeding, because then changes() will be even, the addition will make it odd, and then dividing by two and flooring will ignore the addition. But if the probe started out failing and ended up succeeding, changes() will be odd, the addition will make it even, and dividing by two will 'round up'.

However, this has the drawback that it will completely ignore time series that didn't exist at the end of the time period. Because addition in Prometheus does set union and the disappeared time series aren't present in the right side set, their changes() disappears entirely. As far as I can see, there is no way out of this that avoids a second full scan of your metric over the time range. At that point you might as well use resets().

For dashboard display purposes you might accept the simple 'changes()/2' approach with no clever compensation for odd changes() values, and add a note about why the numbers could have a <N>.5 value. Not everything on your dashboards and graphs has to be completely, narrowly correct all of the time even at the cost of significant overhead.

(This is one of the entries I'm writing partly for my future self. I'd hate to have to re-derive all of this logic in the future when I already did it once.)

Written on 12 April 2021.
« Vendors put varied and peculiar things in system DMI information
SSD versus NVMe for basic servers today (in early 2021) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Apr 12 00:01:41 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.