Some things on delays and timings for Prometheus alerts

October 18, 2018

One of the things I've been doing lately is testing and experimenting with Prometheus alerts. When you're testing alerts you become quite interested in having your alerts fire and clear rapidly, so you can iterate tests rapidly; it is no fun sitting there for five or ten minutes waiting for everything to reset so you can try something new from a clean slate. Also, of course, I have been thinking about how we want to set various alert-related parameters in an eventual production deployment.

Let's start with the timeline of a simple case, where an event produces a single alert:

  1. the actual event happens (your service falls over)
  2. Prometheus notices when it next scrapes (or tries to scrape) the service's metrics. This may be up to your scrape_interval later, if your timing is unlucky. At this point the event is visible in Prometheus metrics.

  3. Prometheus evaluates alerting rules and realizes that this is alertable. This may be up to your evaluation_interval later. If the alert rule has no 'for: <duration>' clause, the alert is immediately firing (and we go to #5); otherwise, it is pending.

    At this point, the alert's existence now appears in Prometheus's ALERTS metric, which means that your dashboards can potentially show it as an alert (if they refresh themselves, or you tell them to).

  4. if the alert is pending, Prometheus continues to check the alert rule; if it remains true in every check made through your for: duration, the alert becomes firing. This takes at least your for: duration, maybe a bit more. Prometheus uses whatever set of metrics it has on hand at the time it makes each of these checks, and presumably they happen every evaluation_interval as part of alert rule evaluation.

    This means that there isn't much point to a for duration that is too short to allow for a second metrics scrape. Sure, you may check the alert rule more than once, but you're checking with the same set of metrics and you're going to get the same answer. You're just stalling a bit.

    (So, really, I think you should think of the for duration as 'how many scrapes do I want this to have to be true for'. Then add a bit more time for the delays involved in scrapes and rule evaluations.)

  5. Prometheus sends the firing alert to Alertmanager, and will continue to do so periodically while it's firing (cf).
  6. Alertmanager figures out the grouping for the alert. If the grouping has a group_wait duration, it starts waiting for that much time.
  7. If the alert is (still) firing at the end of the group_wait period, Alertmanager sends notifications. You finally know that the event is happening (if you haven't already noticed from your dashboards, your own use of your service, or people telling you).

The end to end delay for this alert is a composite of the event to scrape delay, the scrape to rule evaluation delay, at least your for duration with maybe a bit more time as you wait for the next alert rule evaluation (if you use a for: duration), and the Alertmanager group_wait duration (if you have one). If you want fast alerts, you will have to figure out where you can chop out time. Alternately, if you're okay with slow alerts if it gets you advantages, you need to think about which advantages you care about and what the tradeoffs are.

The obvious places to add or cut time are in Prometheus's for and Alertmanager's group_wait. There's a non-obvious and not entirely documented difference between them, which is that Alertmanager only cares about the state of the alert at the start and at the end of the group_wait time; unlike Prometheus, it doesn't care if the alert stops being firing for a while in the middle. This means that only Prometheus can de-bounce flapping alerts. However, if your alerts don't flap and you're only waiting in the hopes that they'll cure themselves, in theory you can wait in either place. Well, if you have a simple situation.

Now suppose that your service falling over creates multiple alerts that are driven by different metric scrapes, and you would like to aggregate them together into one alert message. Now it matters where you wait, because the firing alerts can only be aggregated together once they reach Alertmanager and only if they all get there within the group_wait time window. The shorter the group_wait time, the narrower your window to have everything scraped and evaluated in Prometheus is, so that they all escape their for duration close enough to each other for Alertmanager.

(Or you could get lucky, with both sets of metrics scraped sufficiently close to each other that their alert rules are both processed in the same alert rule evaluation cycle, so that their for waits start and end at the same time and they'll go to Alertmanager together.)

So, I think that the more you care about alert aggregation, the more you need to shift your delay time from Prometheus's for to Alertmanager's group_wait. To get a short group_wait and still reliably aggregate alerts together, I think you need to set up your scrape and rule evaluation intervals so that different metrics scrapes are all reliably ingested and processed within the group_wait interval.

Suppose, for example, that you have both set to 15 seconds. Then when an event beings at time T, you know that all metrics reflecting it will be scraped by Prometheus after at most 15 seconds after T (plus up to almost your scrape timeout interval) and their alert rules should be processed after at most 15 seconds or so afterward. At this point all alerts with for conditions will have become pending and started counting down, and they will all transition to firing at most 30 seconds apart (plus wiggle room for scrape and rule evaluation slowness). If you give Alertmanager a 45 second group_wait, it's almost certain that you'll get them aggregated together. 30 seconds might be pushing it, but you'll probably make it most of the time; you would have to be pretty unlucky for one scrape to happen immediately after T with an alert rule evaluation right after it (so that it becomes pending at T+2 or so), then another metric scrape at T+14.9 seconds, have that scrape be slow, and then only get its alert rules evaluated at, say, T+33.

Things are clearly different if you have one scrape source on a 15 second scrape_interval (perhaps an actual metrics point) and another one on a one or two minute scrape_interval or update interval (perhaps an expensive blackbox check, or perhaps a Pushgateway metric that only gets updated once a minute from cron). Here you'd have to be lucky to have Alertmanager aggregate the alerts together with a 30 or 45 second group_wait time.

(One thing that was useful here is Prometheus: understanding the delays on alerting, which has pictures. It dates from 2016 so some of the syntax is a bit different, but the concepts don't seem to have changed.)

PS: Since I tested this, Alertmanager does not send out any message if it receives a firing alert from Prometheus and then the alert goes away before the group_wait period is up, not even a 'this was resolved' message if you have send_resolved turned on. This is reasonable from one perspective and potentially irritating from another, depending on what you want.

Written on 18 October 2018.
« Link: Vectorized Emulation [of CPUs and virtual machines]
The original Unix ed(1) didn't load files being edited into memory »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Oct 18 23:53:14 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.