An example of how Prometheus's delta() function will extrapolate time ranges

January 20, 2024

Recently, someone came to the Prometheus mailing list with an interesting issue they were having, where they were using Prometheus's delta() function to look at the amount of change over some time range, but were getting results that they didn't expect. They had a relatively slow changing metric where they could look at the value at the start of a fifteen minute time interval, the value at the end of it, and the delta() result for 'metric[15m]', but the delta() result didn't equal the difference they saw; it was instead visibly higher. To make things more confusing, this was a frequently scraped metric, collected every fifteen seconds.

What is happening is explained by an innocent sounding sentence in the documentation for delta():

The delta is extrapolated to cover the full time range as specified in the range vector selector, so that it is possible to get a non-integer result even if the sample values are all integers.

(Similar wording is in the documentation for both increase() and everyone's favorite, rate().)

How you get raw time series data from Prometheus, including its timestamps, is with an instant query that gives you a range vector as the result, such as 'metric[15m]'. You can do this in the web interface or via 'promtool query instant', and the person asking for help shared the results of their query:

promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'metric[15m]'
9732212 @[1705586407.092]
[...]
9848219 @[1705587292.092]

The true difference between the first and the last metric point is 116007, but delta() reported its result as '117973.22033898304':

promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'delta(metric [15m])'
{} => 117973.22033898304 @[1705587300]

Surprisingly, this is what you would actually expect from the query results, and we can work through this from the raw information we have. The fifteen minute time range covers 14:00:00 UTC to 14:15:00 UTC, but the first actual time series point in it was from 14:00:07 UTC and the last was from 14:14:52. This means that delta() will extrapolate out to cover 15 more seconds than the range vector covers (7 seconds at the start, 8 seconds at the end), and the range vector itself covers 15 minutes less 15 seconds, or 885 seconds.

(We can also get the coverage of the range vector from subtracting the first timestamp from the last one; this also gives us 885 seconds.)

Turning to bc (or the calculator of your choice), we can calculate first the scaling factor of the extrapolation and then the actual numerical result:

$ bc -l
(15*60) / 885
1.01694915254237288135

(( 15 * 60 ) / 885 ) * 116007
117973.22033898305084676945

It certainly feels weird that a mere fifteen second gap in a (nominally) fifteen minute range can cause such a clear difference, but that's how it works out. The absolute difference will be smaller if the numbers involved are smaller, but for a given gap, the ratio of the difference will always be the same.

(You might also feel that a fifteen second scrape interval should be fast enough to avoid this sort of issue but again, it's clearly not the case on a fifteen minute range, or even a smaller one. This may especially be an issue if you're doing rate() on relatively small time ranges as part of a Grafana graph.)

Written on 20 January 2024.
« A Django gotcha with Python 3 and the encoding of CharFields
The expected size of a gap in a Prometheus range vector (sometimes) »

Page tools: View Source.
Search:
Login: Password:

Last modified: Sat Jan 20 22:48:40 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.