Wandering Thoughts archives


Some things on delays and timings for Prometheus alerts

One of the things I've been doing lately is testing and experimenting with Prometheus alerts. When you're testing alerts you become quite interested in having your alerts fire and clear rapidly, so you can iterate tests rapidly; it is no fun sitting there for five or ten minutes waiting for everything to reset so you can try something new from a clean slate. Also, of course, I have been thinking about how we want to set various alert-related parameters in an eventual production deployment.

Let's start with the timeline of a simple case, where an event produces a single alert:

  1. the actual event happens (your service falls over)
  2. Prometheus notices when it next scrapes (or tries to scrape) the service's metrics. This may be up to your scrape_interval later, if your timing is unlucky. At this point the event is visible in Prometheus metrics.

  3. Prometheus evaluates alerting rules and realizes that this is alertable. This may be up to your evaluation_interval later. If the alert rule has no 'for: <duration>' clause, the alert is immediately firing (and we go to #5); otherwise, it is pending.

    At this point, the alert's existence now appears in Prometheus's ALERTS metric, which means that your dashboards can potentially show it as an alert (if they refresh themselves, or you tell them to).

  4. if the alert is pending, Prometheus continues to check the alert rule; if it remains true in every check made through your for: duration, the alert becomes firing. This takes at least your for: duration, maybe a bit more. Prometheus uses whatever set of metrics it has on hand at the time it makes each of these checks, and presumably they happen every evaluation_interval as part of alert rule evaluation.

    This means that there isn't much point to a for duration that is too short to allow for a second metrics scrape. Sure, you may check the alert rule more than once, but you're checking with the same set of metrics and you're going to get the same answer. You're just stalling a bit.

    (So, really, I think you should think of the for duration as 'how many scrapes do I want this to have to be true for'. Then add a bit more time for the delays involved in scrapes and rule evaluations.)

  5. Prometheus sends the firing alert to Alertmanager, and will continue to do so periodically while it's firing (cf).
  6. Alertmanager figures out the grouping for the alert. If the grouping has a group_wait duration, it starts waiting for that much time.
  7. If the alert is (still) firing at the end of the group_wait period, Alertmanager sends notifications. You finally know that the event is happening (if you haven't already noticed from your dashboards, your own use of your service, or people telling you).

The end to end delay for this alert is a composite of the event to scrape delay, the scrape to rule evaluation delay, at least your for duration with maybe a bit more time as you wait for the next alert rule evaluation (if you use a for: duration), and the Alertmanager group_wait duration (if you have one). If you want fast alerts, you will have to figure out where you can chop out time. Alternately, if you're okay with slow alerts if it gets you advantages, you need to think about which advantages you care about and what the tradeoffs are.

The obvious places to add or cut time are in Prometheus's for and Alertmanager's group_wait. There's a non-obvious and not entirely documented difference between them, which is that Alertmanager only cares about the state of the alert at the start and at the end of the group_wait time; unlike Prometheus, it doesn't care if the alert stops being firing for a while in the middle. This means that only Prometheus can de-bounce flapping alerts. However, if your alerts don't flap and you're only waiting in the hopes that they'll cure themselves, in theory you can wait in either place. Well, if you have a simple situation.

Now suppose that your service falling over creates multiple alerts that are driven by different metric scrapes, and you would like to aggregate them together into one alert message. Now it matters where you wait, because the firing alerts can only be aggregated together once they reach Alertmanager and only if they all get there within the group_wait time window. The shorter the group_wait time, the narrower your window to have everything scraped and evaluated in Prometheus is, so that they all escape their for duration close enough to each other for Alertmanager.

(Or you could get lucky, with both sets of metrics scraped sufficiently close to each other that their alert rules are both processed in the same alert rule evaluation cycle, so that their for waits start and end at the same time and they'll go to Alertmanager together.)

So, I think that the more you care about alert aggregation, the more you need to shift your delay time from Prometheus's for to Alertmanager's group_wait. To get a short group_wait and still reliably aggregate alerts together, I think you need to set up your scrape and rule evaluation intervals so that different metrics scrapes are all reliably ingested and processed within the group_wait interval.

Suppose, for example, that you have both set to 15 seconds. Then when an event beings at time T, you know that all metrics reflecting it will be scraped by Prometheus after at most 15 seconds after T (plus up to almost your scrape timeout interval) and their alert rules should be processed after at most 15 seconds or so afterward. At this point all alerts with for conditions will have become pending and started counting down, and they will all transition to firing at most 30 seconds apart (plus wiggle room for scrape and rule evaluation slowness). If you give Alertmanager a 45 second group_wait, it's almost certain that you'll get them aggregated together. 30 seconds might be pushing it, but you'll probably make it most of the time; you would have to be pretty unlucky for one scrape to happen immediately after T with an alert rule evaluation right after it (so that it becomes pending at T+2 or so), then another metric scrape at T+14.9 seconds, have that scrape be slow, and then only get its alert rules evaluated at, say, T+33.

Things are clearly different if you have one scrape source on a 15 second scrape_interval (perhaps an actual metrics point) and another one on a one or two minute scrape_interval or update interval (perhaps an expensive blackbox check, or perhaps a Pushgateway metric that only gets updated once a minute from cron). Here you'd have to be lucky to have Alertmanager aggregate the alerts together with a 30 or 45 second group_wait time.

(One thing that was useful here is Prometheus: understanding the delays on alerting, which has pictures. It dates from 2016 so some of the syntax is a bit different, but the concepts don't seem to have changed.)

PS: Since I tested this, Alertmanager does not send out any message if it receives a firing alert from Prometheus and then the alert goes away before the group_wait period is up, not even a 'this was resolved' message if you have send_resolved turned on. This is reasonable from one perspective and potentially irritating from another, depending on what you want.

sysadmin/PrometheusAlertDelays written at 23:53:14; Add Comment

Link: Vectorized Emulation [of CPUs and virtual machines]

Vectorized Emulation: Hardware accelerated taint tracking at 2 trillion instructions per second (via) is about, well, let me quote from the introduction rather than try to further summarize it:

In this blog I’m going to introduce you to a concept I’ve been working on for almost 2 years now. Vectorized emulation. The goal is to take standard applications and JIT them to their AVX-512 equivalent such that we can fuzz 16 VMs at a time per thread. The net result of this work allows for high performance fuzzing (approx 40 billion to 120 billion instructions per second [the 2 trillion clickbait number is theoretical maximum]) depending on the target, while gathering differential coverage on code, register, and memory state.

Naturally you need to do all sorts of interesting tricks to make this work. The entry is an overview, and the author is going to write more entries later on the details of various aspects of it, which I'm certainly looking forward to even if I'm not necessarily going to fully follow the details.

I found this interesting both by itself and for giving me some more insight into modern SIMD instructions and what goes into using them. SIMD and GPU computing feel like something that I should understand some day.

(I find SIMD kind of mind bending and I've never really dug into how modern x86 machines do this sort of stuff and what you use it for.)

links/VectorizedEmulation written at 20:06:28; Add Comment

Why you should be willing to believe that ed(1) is a good editor

Among the reactions to my entry on how ed(1) is no longer a good editor today was people wondering out loud if ed was ever a good editor. My answer is that yes, ed is and was good editor in the right situations, and I intend to write an entry about that. But before I write about why ed is a good editor, I need to write about why you should be willing to believe that it is. To put it simply, why you should believe that ed is a good editor has nothing to do with anything about its technical merits and everything to do with its history.

Ed was created and nurtured by the same core Bell Labs people who created Unix, people like Dennis Ritchie and Ken Thompson. Ed wasn't their first editor; instead, it was the end product of a whole series of iterations of the same fundamental idea, created in the computing environment of the late 1960s and early to mid 1970s. The Bell Labs Unix people behind ed were smart, knew what they were doing, had done this many times before, had good taste, were picky about their tools, used ed a great deal themselves, and were not afraid to completely redo Unix programs that they felt were not up to what they should be (the Unix shell was completely redesigned from the ground up between V6 and V7, for example). And what these people produced and used was ed, not anything else, even though it's clear that they could have had something else if they'd wanted it and they certainly knew that other options were possible. Ed is clearly not the product of limited knowledge, imagination, skill, taste, or indifference to how good the program was.

It's certainly possible to believe that the Bell Labs Research Unix people had no taste in general, if you dislike Unix as a whole; in that case, ed is one more brick in the wall. But if you like Unix and think that V7 Unix is well designed and full of good work, it seems a bit of a stretch to believe that all of the Bell Labs people were so uniquely blinded that they made a great Unix but a bad editor, one that they didn't recognize as such even though they used it to write the entire system.

Nor do I think that resource constraints are a convincing explanation. While the very limited hardware of the very earliest Unix machines might have forced early versions of ed to be more limited than prior editors like QED, by the time of V7, Bell Labs was running Unix on reasonably good hardware for the time.

The conclusion is inescapable. The people at Bell Labs who created Unix found ed to be a good editor. Since they got so much else right and saw so many things so clearly, perhaps we should consider that ed itself has merits that we don't see today, or don't see as acutely as they did back then.

unix/EdBelieveGoodEditor written at 00:42:26; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.