An example of how Prometheus's delta() function will extrapolate time ranges
Recently, someone came to the Prometheus mailing list with an
interesting issue they were having, where they were using Prometheus's
delta()
function to look at the amount of change over some time range, but
were getting results that they didn't expect. They had a relatively
slow changing metric where they could look at the value at the start
of a fifteen minute time interval, the value at the end of it, and
the delta() result for 'metric[15m]', but the delta() result didn't
equal the difference they saw; it was instead visibly higher. To
make things more confusing, this was a frequently scraped metric,
collected every fifteen seconds.
What is happening is explained by an innocent sounding sentence
in the documentation for delta()
:
The delta is extrapolated to cover the full time range as specified in the range vector selector, so that it is possible to get a non-integer result even if the sample values are all integers.
(Similar wording is in the documentation for both increase()
and everyone's favorite, rate()
.)
How you get raw time series data from Prometheus, including its timestamps, is with an instant query that gives you a range vector as the result, such as 'metric[15m]'. You can do this in the web interface or via 'promtool query instant', and the person asking for help shared the results of their query:
promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'metric[15m]' 9732212 @[1705586407.092] [...] 9848219 @[1705587292.092]
The true difference between the first and the last metric point is 116007, but delta() reported its result as '117973.22033898304':
promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'delta(metric [15m])' {} => 117973.22033898304 @[1705587300]
Surprisingly, this is what you would actually expect from the query results, and we can work through this from the raw information we have. The fifteen minute time range covers 14:00:00 UTC to 14:15:00 UTC, but the first actual time series point in it was from 14:00:07 UTC and the last was from 14:14:52. This means that delta() will extrapolate out to cover 15 more seconds than the range vector covers (7 seconds at the start, 8 seconds at the end), and the range vector itself covers 15 minutes less 15 seconds, or 885 seconds.
(We can also get the coverage of the range vector from subtracting the first timestamp from the last one; this also gives us 885 seconds.)
Turning to bc (or the calculator of your choice), we can calculate first the scaling factor of the extrapolation and then the actual numerical result:
$ bc -l (15*60) / 885 1.01694915254237288135 (( 15 * 60 ) / 885 ) * 116007 117973.22033898305084676945
It certainly feels weird that a mere fifteen second gap in a (nominally) fifteen minute range can cause such a clear difference, but that's how it works out. The absolute difference will be smaller if the numbers involved are smaller, but for a given gap, the ratio of the difference will always be the same.
(You might also feel that a fifteen second scrape interval should
be fast enough to avoid this sort of issue but again, it's clearly
not the case on a fifteen minute range, or even a smaller one. This
may especially be an issue if you're doing rate()
on relatively
small time ranges as part of a Grafana graph.)
|
|