A little surprise with Prometheus scrape intervals, timeouts, and alerts

January 23, 2019

Prometheus pulls metrics from metric sources or, to put it in Prometheus terms, scrapes targets. Every scrape configuration and thus every target has a scrape interval and a scrape timeout as part of its settings; these can be specified explicitly or inherited from global values. In a perfect world where scraping targets either completes or fails in zero time, this results in simple timing; a target is scraped at time T, then T + interval, then T + interval + interval, and so on. However, the real world is not simple and scraping a target can take a non-zero amount of time, possibly quite a lot if you time out. You might sensibly wonder if the next scrape is pushed back by the non-zero scrape time.

The answer is that it is not, or at least it is sort of not. Regardless of the amount of time a scrape at time T takes, the next scrape is scheduled for T + interval and will normally happen then. Scrapes are driven by a ticker, which runs independently of how long a scrape took and adjust things as necessary to keep ticking exactly on time.

So far, so good. But this means that slow scrapes can have an interesting and surprising interaction with alerting rules and Alertmanager group_wait settings. The short version is that you can get a failing check and then a successful one in close succession, close enough to suppress an Alertmanager alert that you would normally expect to fire.

To make this concrete, suppose that you perform SSH blackbox checks every 90 seconds, time out at 60 seconds, trigger a Prometheus alert rule the moment a SSH check fails, and have a one minute group_wait in Alertmanager. Then if a SSH check times out instead of failing rapidly, you can have a sequence where you start the check at T, have it fail via timeout at T + 60, send a firing alert to Alertmanager shortly afterward, have the next check succeed at T + 90, and withdraw the alert shortly afterward from Alertmanager, before the one minute group_wait is up. The net result is that your 'alert immediately' SSH alert rule has not sent you an alert despite a SSH check failing.

It's natural to expect this result if your scrape interval is less than your group_wait, because then it's obvious that you can get a second scrape in before Alertmanager makes the alert active. It's not as obvious when the second scrape is possible only because the difference between the scrape interval and the scrape timeout is less than group_wait.

(If nothing else, this is going to make me take another look at our scrape timeout settings. I'm going to have to think carefully about just what all of the interactions are here, especially given all of the other alert delays. Note that a resolved alert is immediately sent to Alertmanager.)

PS: It's a pity that there's no straightforward way that I know of to get either Prometheus or Alertmanager to write a log record of pending, firing, and cleared alerts (with timestamps and details). The information is more or less captured in Prometheus metrics, but getting the times when things happened is a huge pain; being able to write optional logs of this would make some things much easier.

(I believe both report this if you set their log level to 'debug', but of course then you get a flood of other information that you probably don't want.)

Sidebar: How Prometheus picks the start time T of scrapes

If you've paid attention to your logs from things like SSH blackbox checks, you'll have noticed that Prometheus does not hit all of your scrape targets at exactly the same time, even if they have the same scrape interval. How Prometheus picks the start time for each scrape target is not based on when it learns about the scrape target, as you might expect; instead, well, let me quote the code:

base   = interval - now%interval
offset = t.hash() % interval
next   = base + offset

if next > interval {
   next -= interval

All of these values are in nanoseconds, and t.hash() is a 64-bit hash value, hopefully randomly distributed. The next result value is an offset to wait before starting the scrape interval ticker.

In short, Prometheus randomly smears the start time for scrape targets across the entire interval, hopefully resulting in a more or less even distribution.

Written on 23 January 2019.
« Things you can do to make your Linux servers reboot on kernel problems
Consider setting your Linux servers to reboot on kernel problems »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jan 23 01:36:30 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.