A little surprise with Prometheus scrape intervals, timeouts, and alerts
Prometheus pulls metrics from metric sources or, to put it in Prometheus terms, scrapes targets. Every scrape configuration and thus every target has a scrape interval and a scrape timeout as part of its settings; these can be specified explicitly or inherited from global values. In a perfect world where scraping targets either completes or fails in zero time, this results in simple timing; a target is scraped at time T, then T + interval, then T + interval + interval, and so on. However, the real world is not simple and scraping a target can take a non-zero amount of time, possibly quite a lot if you time out. You might sensibly wonder if the next scrape is pushed back by the non-zero scrape time.
The answer is that it is not, or at least it is sort of not. Regardless of the amount of time a scrape at time T takes, the next scrape is scheduled for T + interval and will normally happen then. Scrapes are driven by a ticker, which runs independently of how long a scrape took and adjust things as necessary to keep ticking exactly on time.
So far, so good. But this means that slow scrapes can have an
interesting and surprising interaction with alerting rules and
Alertmanager group_wait
settings. The short version is that you
can get a failing check and then a successful one in close succession,
close enough to suppress an Alertmanager alert that you would
normally expect to fire.
To make this concrete, suppose that you perform SSH blackbox checks
every 90 seconds, time out at 60 seconds, trigger a Prometheus alert
rule the moment a SSH check fails, and have a one minute group_wait
in Alertmanager. Then if a SSH check times out instead of failing
rapidly, you can have a sequence where you start the check at T, have
it fail via timeout at T + 60, send a firing alert to Alertmanager
shortly afterward, have the next check succeed at T + 90, and withdraw
the alert shortly afterward from Alertmanager, before the one minute
group_wait
is up. The net result is that your 'alert immediately'
SSH alert rule has not sent you an alert despite a SSH check failing.
It's natural to expect this result if your scrape interval is less
than your group_wait
, because then it's obvious that you can
get a second scrape in before Alertmanager makes the alert active.
It's not as obvious when the second scrape is possible only because
the difference between the scrape interval and the scrape timeout
is less than group_wait
.
(If nothing else, this is going to make me take another look at our scrape timeout settings. I'm going to have to think carefully about just what all of the interactions are here, especially given all of the other alert delays. Note that a resolved alert is immediately sent to Alertmanager.)
PS: It's a pity that there's no straightforward way that I know of to get either Prometheus or Alertmanager to write a log record of pending, firing, and cleared alerts (with timestamps and details). The information is more or less captured in Prometheus metrics, but getting the times when things happened is a huge pain; being able to write optional logs of this would make some things much easier.
(I believe both report this if you set their log level to 'debug', but of course then you get a flood of other information that you probably don't want.)
Sidebar: How Prometheus picks the start time T of scrapes
If you've paid attention to your logs from things like SSH blackbox checks, you'll have noticed that Prometheus does not hit all of your scrape targets at exactly the same time, even if they have the same scrape interval. How Prometheus picks the start time for each scrape target is not based on when it learns about the scrape target, as you might expect; instead, well, let me quote the code:
base = interval - now%interval offset = t.hash() % interval next = base + offset if next > interval { next -= interval }
All of these values are in nanoseconds, and t.hash()
is a 64-bit
hash value, hopefully randomly distributed. The next
result value
is an offset to wait before starting the scrape interval ticker.
In short, Prometheus randomly smears the start time for scrape targets across the entire interval, hopefully resulting in a more or less even distribution.
|
|