Wandering Thoughts archives

2024-01-21

The expected size of a gap in a Prometheus range vector (sometimes)

In yesterday's example of delta() extrapolating to cover a full time range, we saw that an example fifteen minute range vector for a metric actually covered a time range less than fifteen minutes. In fact, it covered fifteen seconds less than fifteen minutes, and the scrape interval for the metric in question was fifteen seconds. In thinking about it, I've realized that this isn't a coincidence and in fact I believe that nearly all of the time, many range vectors for many time ranges will actually cover that time range less one scrape interval for the metric in question, whatever that scrape interval is. Specifically, any time range that is a multiple of the scrape interval will likely behave this way.

As we've seen before, Prometheus randomizes the scrape time for any particular target. If you scrape a target every fifteen seconds, it will almost never be scraped at x:00, x:15, x:30, and x:45; instead it will almost always be scraped at some constant offset that varies from target to target. This is sensible behavior to keep all of your scrape targets from being hammered like clockwork at the start of every minute, but it also means that a range query will almost never align exactly with the scrape times for a particular target.

If the scrape times aren't aligned with the range query's start and end time but the scrape interval evenly divides the range (for example, a 15 minute range and a 15 second scrape interval), then the first time series point will have some offset after the start of the range and the last point will have some offset before it. If the scrape durations are consistent and low (as they often are), what we have is a situation where the timeline of scrapes is offset from the range's timeline by some amount. The first time series point is 'late' (from the range's perspective) by this amount, and then what would be the first time series point after the range is also 'late' by that same amount, which means that the last point within the range is 'early' by the scrape interval minus the offset.

Let's make this concrete. Imagine a one minute range vector and a 15 second scrape interval where the scrapes happen at 0:07, 0:22, 0:37, 0:52, and 1:07 (relative to the range). The scrapes are 'late' by seven seconds relative to the range vector, but the last time series point at 1:07 is outside the one minute range vector, leaving us with the last included point being 15 seconds before it and (15-7) or 8 seconds before the end of the range vector.

A more complex and less predictable situation happens if the range is not a multiple of the scrape interval. This can happen either because your scrape interval is not even and your ranges are, or your scrape interval is even but your ranges vary widely, for example if they're set by the step resolution of a Grafana graph panel ('$__interval' in Grafana dashboard jargon, which is a sensible interval setting). Locally, we have a lot of slower scrape intervals that are prime numbers so that they deliberately don't get into some fixed alignment with wall clock time; these are fairly unlikely to line up with range intervals this way.

(Possibly this is obvious to everyone but me but it felt a little bit surprising to me when it came up yesterday, so I want to write it down so I remember it in the future.)

PS: Where this may be an issue even for us is in alert rules, which may well use fixed, nice-number range durations (like '1m' or '5m') and draw from metrics sources, such as the host agent, that we scrape every fifteen seconds.

sysadmin/PrometheusRangeVectorGapSize written at 23:24:27;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.