Some things on delays and timings for Prometheus alerts
One of the things I've been doing lately is testing and experimenting with Prometheus alerts. When you're testing alerts you become quite interested in having your alerts fire and clear rapidly, so you can iterate tests rapidly; it is no fun sitting there for five or ten minutes waiting for everything to reset so you can try something new from a clean slate. Also, of course, I have been thinking about how we want to set various alert-related parameters in an eventual production deployment.
Let's start with the timeline of a simple case, where an event produces a single alert:
- the actual event happens (your service falls over)
- Prometheus notices when it next scrapes (or tries to scrape) the
service's metrics. This may be up to your
scrape_intervallater, if your timing is unlucky. At this point the event is visible in Prometheus metrics.
- Prometheus evaluates alerting rules and realizes that this is
alertable. This may be up to your
evaluation_intervallater. If the alert rule has no '
for: <duration>' clause, the alert is immediately firing (and we go to #5); otherwise, it is pending.
At this point, the alert's existence now appears in Prometheus's
ALERTSmetric, which means that your dashboards can potentially show it as an alert (if they refresh themselves, or you tell them to).
- if the alert is pending, Prometheus continues to check the alert
rule; if it remains true in every check made through your
for:duration, the alert becomes firing. This takes at least your
for:duration, maybe a bit more. Prometheus uses whatever set of metrics it has on hand at the time it makes each of these checks, and presumably they happen every
evaluation_intervalas part of alert rule evaluation.
This means that there isn't much point to a
forduration that is too short to allow for a second metrics scrape. Sure, you may check the alert rule more than once, but you're checking with the same set of metrics and you're going to get the same answer. You're just stalling a bit.
(So, really, I think you should think of the
forduration as 'how many scrapes do I want this to have to be true for'. Then add a bit more time for the delays involved in scrapes and rule evaluations.)
- Prometheus sends the firing alert to Alertmanager, and will continue to do so periodically while it's firing (cf).
- Alertmanager figures out the grouping for the alert. If the grouping
group_waitduration, it starts waiting for that much time.
- If the alert is (still) firing at the end of the
group_waitperiod, Alertmanager sends notifications. You finally know that the event is happening (if you haven't already noticed from your dashboards, your own use of your service, or people telling you).
The end to end delay for this alert is a composite of the event to
scrape delay, the scrape to rule evaluation delay, at least your
for duration with maybe a bit more time as you wait for the next
alert rule evaluation (if you use a
for: duration), and the
group_wait duration (if you have one). If you want
fast alerts, you will have to figure out where you can chop out
time. Alternately, if you're okay with slow alerts if it gets you
advantages, you need to think about which advantages you care about
and what the tradeoffs are.
The obvious places to add or cut time are in Prometheus's
group_wait. There's a non-obvious and
not entirely documented difference between them, which is that
Alertmanager only cares about the state of the alert at the start
and at the end of the
group_wait time; unlike Prometheus, it
doesn't care if the alert stops being firing for a while in the
middle. This means that only Prometheus can de-bounce flapping
alerts. However, if your alerts don't flap and you're only
waiting in the hopes that they'll cure themselves, in theory you
can wait in either place. Well, if you have a simple situation.
Now suppose that your service falling over creates multiple alerts
that are driven by different metric scrapes, and you would like to
aggregate them together into one alert message. Now it matters where
you wait, because the firing alerts can only be aggregated together
once they reach Alertmanager and only if they all get there within
group_wait time window. The shorter the
the narrower your window to have everything scraped and evaluated
in Prometheus is, so that they all escape their
for duration close
enough to each other for Alertmanager.
(Or you could get lucky, with both sets of metrics scraped sufficiently
close to each other that their alert rules are both processed in
the same alert rule evaluation cycle, so that their
start and end at the same time and they'll go to Alertmanager
So, I think that the more you care about alert aggregation, the
more you need to shift your delay time from Prometheus's
group_wait. To get a short
still reliably aggregate alerts together, I think you need to set
up your scrape and rule evaluation intervals so that different
metrics scrapes are all reliably ingested and processed within the
Suppose, for example, that you have both set to 15 seconds. Then
when an event beings at time T, you know that all metrics reflecting
it will be scraped by Prometheus after at most 15 seconds after T
(plus up to almost your scrape timeout interval) and their alert
rules should be processed after at most 15 seconds or so afterward.
At this point all alerts with
for conditions will have become
pending and started counting down, and they will all transition
to firing at most 30 seconds apart (plus wiggle room for scrape
and rule evaluation slowness). If you give Alertmanager a 45 second
group_wait, it's almost certain that you'll get them aggregated
together. 30 seconds might be pushing it, but you'll probably make
it most of the time; you would have to be pretty unlucky for one
scrape to happen immediately after T with an alert rule evaluation
right after it (so that it becomes pending at T+2 or so), then
another metric scrape at T+14.9 seconds, have that scrape be slow,
and then only get its alert rules evaluated at, say, T+33.
Things are clearly different if you have one scrape source on a 15
scrape_interval (perhaps an actual metrics point) and
another one on a one or two minute
scrape_interval or update
interval (perhaps an expensive blackbox check, or perhaps a Pushgateway
metric that only gets updated once a minute from cron). Here you'd
have to be lucky to have Alertmanager aggregate the alerts together
with a 30 or 45 second
(One thing that was useful here is Prometheus: understanding the delays on alerting, which has pictures. It dates from 2016 so some of the syntax is a bit different, but the concepts don't seem to have changed.)
PS: Since I tested this, Alertmanager does not send out any message
if it receives a firing alert from Prometheus and then the alert
goes away before the
group_wait period is up, not even a 'this
was resolved' message if you have
send_resolved turned on. This
is reasonable from one perspective and potentially irritating from
another, depending on what you want.