The expected size of a gap in a Prometheus range vector (sometimes)
In yesterday's example of delta()
extrapolating to cover a full
time range, we saw that an example
fifteen minute range vector for a metric actually covered a time
range less than fifteen minutes. In fact, it covered fifteen seconds
less than fifteen minutes, and the scrape interval for the metric
in question was fifteen seconds. In thinking about it, I've realized
that this isn't a coincidence and in fact I believe that nearly all
of the time, many range vectors for many time ranges will actually
cover that time range less one scrape interval for the metric in
question, whatever that scrape interval is. Specifically, any time
range that is a multiple of the scrape interval will likely behave
this way.
As we've seen before, Prometheus randomizes the scrape time for any particular target. If you scrape a target every fifteen seconds, it will almost never be scraped at x:00, x:15, x:30, and x:45; instead it will almost always be scraped at some constant offset that varies from target to target. This is sensible behavior to keep all of your scrape targets from being hammered like clockwork at the start of every minute, but it also means that a range query will almost never align exactly with the scrape times for a particular target.
If the scrape times aren't aligned with the range query's start and end time but the scrape interval evenly divides the range (for example, a 15 minute range and a 15 second scrape interval), then the first time series point will have some offset after the start of the range and the last point will have some offset before it. If the scrape durations are consistent and low (as they often are), what we have is a situation where the timeline of scrapes is offset from the range's timeline by some amount. The first time series point is 'late' (from the range's perspective) by this amount, and then what would be the first time series point after the range is also 'late' by that same amount, which means that the last point within the range is 'early' by the scrape interval minus the offset.
Let's make this concrete. Imagine a one minute range vector and a 15 second scrape interval where the scrapes happen at 0:07, 0:22, 0:37, 0:52, and 1:07 (relative to the range). The scrapes are 'late' by seven seconds relative to the range vector, but the last time series point at 1:07 is outside the one minute range vector, leaving us with the last included point being 15 seconds before it and (15-7) or 8 seconds before the end of the range vector.
A more complex and less predictable situation happens if the range
is not a multiple of the scrape interval. This can happen either
because your scrape interval is not even and your ranges are, or
your scrape interval is even but your ranges vary widely, for example
if they're set by the step resolution of a Grafana graph panel
('$__interval
'
in Grafana dashboard jargon, which is a sensible interval setting). Locally, we have a lot of slower
scrape intervals that are prime numbers so that they deliberately
don't get into some fixed alignment with wall clock time; these
are fairly unlikely to line up with range intervals this way.
(Possibly this is obvious to everyone but me but it felt a little bit surprising to me when it came up yesterday, so I want to write it down so I remember it in the future.)
PS: Where this may be an issue even for us is in alert rules, which may well use fixed, nice-number range durations (like '1m' or '5m') and draw from metrics sources, such as the host agent, that we scrape every fifteen seconds.
|
|