2018-10-18
Some things on delays and timings for Prometheus alerts
One of the things I've been doing lately is testing and experimenting with Prometheus alerts. When you're testing alerts you become quite interested in having your alerts fire and clear rapidly, so you can iterate tests rapidly; it is no fun sitting there for five or ten minutes waiting for everything to reset so you can try something new from a clean slate. Also, of course, I have been thinking about how we want to set various alert-related parameters in an eventual production deployment.
Let's start with the timeline of a simple case, where an event produces a single alert:
- the actual event happens (your service falls over)
- Prometheus notices when it next scrapes (or tries to scrape) the
service's metrics. This may be up to your
scrape_interval
later, if your timing is unlucky. At this point the event is visible in Prometheus metrics. - Prometheus evaluates alerting rules and realizes that this is
alertable. This may be up to your
evaluation_interval
later. If the alert rule has no 'for: <duration>
' clause, the alert is immediately firing (and we go to #5); otherwise, it is pending.At this point, the alert's existence now appears in Prometheus's
ALERTS
metric, which means that your dashboards can potentially show it as an alert (if they refresh themselves, or you tell them to). - if the alert is pending, Prometheus continues to check the alert
rule; if it remains true in every check made through your
for:
duration, the alert becomes firing. This takes at least yourfor:
duration, maybe a bit more. Prometheus uses whatever set of metrics it has on hand at the time it makes each of these checks, and presumably they happen everyevaluation_interval
as part of alert rule evaluation.This means that there isn't much point to a
for
duration that is too short to allow for a second metrics scrape. Sure, you may check the alert rule more than once, but you're checking with the same set of metrics and you're going to get the same answer. You're just stalling a bit.(So, really, I think you should think of the
for
duration as 'how many scrapes do I want this to have to be true for'. Then add a bit more time for the delays involved in scrapes and rule evaluations.) - Prometheus sends the firing alert to Alertmanager, and will continue to do so periodically while it's firing (cf).
- Alertmanager figures out the grouping for the alert. If the grouping
has a
group_wait
duration, it starts waiting for that much time. - If the alert is (still) firing at the end of the
group_wait
period, Alertmanager sends notifications. You finally know that the event is happening (if you haven't already noticed from your dashboards, your own use of your service, or people telling you).
The end to end delay for this alert is a composite of the event to
scrape delay, the scrape to rule evaluation delay, at least your
for
duration with maybe a bit more time as you wait for the next
alert rule evaluation (if you use a for:
duration), and the
Alertmanager group_wait
duration (if you have one). If you want
fast alerts, you will have to figure out where you can chop out
time. Alternately, if you're okay with slow alerts if it gets you
advantages, you need to think about which advantages you care about
and what the tradeoffs are.
The obvious places to add or cut time are in Prometheus's for
and Alertmanager's group_wait
. There's a non-obvious and
not entirely documented difference between them, which is that
Alertmanager only cares about the state of the alert at the start
and at the end of the group_wait
time; unlike Prometheus, it
doesn't care if the alert stops being firing for a while in the
middle. This means that only Prometheus can de-bounce flapping
alerts. However, if your alerts don't flap and you're only
waiting in the hopes that they'll cure themselves, in theory you
can wait in either place. Well, if you have a simple situation.
Now suppose that your service falling over creates multiple alerts
that are driven by different metric scrapes, and you would like to
aggregate them together into one alert message. Now it matters where
you wait, because the firing alerts can only be aggregated together
once they reach Alertmanager and only if they all get there within
the group_wait
time window. The shorter the group_wait
time,
the narrower your window to have everything scraped and evaluated
in Prometheus is, so that they all escape their for
duration close
enough to each other for Alertmanager.
(Or you could get lucky, with both sets of metrics scraped sufficiently
close to each other that their alert rules are both processed in
the same alert rule evaluation cycle, so that their for
waits
start and end at the same time and they'll go to Alertmanager
together.)
So, I think that the more you care about alert aggregation, the
more you need to shift your delay time from Prometheus's for
to
Alertmanager's group_wait
. To get a short group_wait
and
still reliably aggregate alerts together, I think you need to set
up your scrape and rule evaluation intervals so that different
metrics scrapes are all reliably ingested and processed within the
group_wait
interval.
Suppose, for example, that you have both set to 15 seconds. Then
when an event beings at time T, you know that all metrics reflecting
it will be scraped by Prometheus after at most 15 seconds after T
(plus up to almost your scrape timeout interval) and their alert
rules should be processed after at most 15 seconds or so afterward.
At this point all alerts with for
conditions will have become
pending and started counting down, and they will all transition
to firing at most 30 seconds apart (plus wiggle room for scrape
and rule evaluation slowness). If you give Alertmanager a 45 second
group_wait
, it's almost certain that you'll get them aggregated
together. 30 seconds might be pushing it, but you'll probably make
it most of the time; you would have to be pretty unlucky for one
scrape to happen immediately after T with an alert rule evaluation
right after it (so that it becomes pending at T+2 or so), then
another metric scrape at T+14.9 seconds, have that scrape be slow,
and then only get its alert rules evaluated at, say, T+33.
Things are clearly different if you have one scrape source on a 15
second scrape_interval
(perhaps an actual metrics point) and
another one on a one or two minute scrape_interval
or update
interval (perhaps an expensive blackbox check, or perhaps a Pushgateway
metric that only gets updated once a minute from cron). Here you'd
have to be lucky to have Alertmanager aggregate the alerts together
with a 30 or 45 second group_wait
time.
(One thing that was useful here is Prometheus: understanding the delays on alerting, which has pictures. It dates from 2016 so some of the syntax is a bit different, but the concepts don't seem to have changed.)
PS: Since I tested this, Alertmanager does not send out any message
if it receives a firing alert from Prometheus and then the alert
goes away before the group_wait
period is up, not even a 'this
was resolved' message if you have send_resolved
turned on. This
is reasonable from one perspective and potentially irritating from
another, depending on what you want.
Link: Vectorized Emulation [of CPUs and virtual machines]
Vectorized Emulation: Hardware accelerated taint tracking at 2 trillion instructions per second (via) is about, well, let me quote from the introduction rather than try to further summarize it:
In this blog I’m going to introduce you to a concept I’ve been working on for almost 2 years now. Vectorized emulation. The goal is to take standard applications and JIT them to their AVX-512 equivalent such that we can fuzz 16 VMs at a time per thread. The net result of this work allows for high performance fuzzing (approx 40 billion to 120 billion instructions per second [the 2 trillion clickbait number is theoretical maximum]) depending on the target, while gathering differential coverage on code, register, and memory state.
Naturally you need to do all sorts of interesting tricks to make this work. The entry is an overview, and the author is going to write more entries later on the details of various aspects of it, which I'm certainly looking forward to even if I'm not necessarily going to fully follow the details.
I found this interesting both by itself and for giving me some more insight into modern SIMD instructions and what goes into using them. SIMD and GPU computing feel like something that I should understand some day.
(I find SIMD kind of mind bending and I've never really dug into how modern x86 machines do this sort of stuff and what you use it for.)
Why you should be willing to believe that ed(1)
is a good editor
Among the reactions to
my entry on how ed(1)
is no longer a good editor today was people wondering out loud if ed
was
ever a good editor. My answer is that yes, ed
is and was good
editor in the right situations, and I intend to write an entry
about that.
But before I write about why ed
is a good editor, I need to write
about why you should be willing to believe that it is. To put it
simply, why you should believe that ed
is a good editor has nothing
to do with anything about its technical merits and everything to
do with its history.
Ed was created and nurtured by the same core Bell Labs people who
created Unix, people like Dennis Ritchie and Ken Thompson.
Ed wasn't their first editor; instead, it was the end product of a
whole series of iterations of the same fundamental idea, created
in the computing environment of the late 1960s and early to mid
1970s. The Bell Labs Unix people behind ed
were smart, knew what
they were doing, had done this many times before, had good taste,
were picky about their tools, used ed
a great deal themselves,
and were not afraid to completely redo Unix programs that they felt
were not up to what they should be (the Unix shell was completely
redesigned from the ground up between V6 and V7, for example). And
what these people produced and used was ed
, not anything else,
even though it's clear that they could have had something else if
they'd wanted it and they certainly knew that other options were
possible. Ed is clearly not the product of limited knowledge,
imagination, skill, taste, or indifference to how good the program
was.
It's certainly possible to believe that the Bell Labs Research Unix
people had no taste in general, if you dislike Unix as a whole; in
that case, ed
is one more brick in the wall. But if you like Unix
and think that V7 Unix is well designed and full of good work, it
seems a bit of a stretch to believe that all of the Bell Labs people
were so uniquely blinded that they made a great Unix but a bad
editor, one that they didn't recognize as such even though they
used it to write the entire system.
Nor do I think that resource constraints are a convincing explanation.
While the very limited hardware of the very earliest Unix machines might
have forced early versions of ed
to be more limited than prior editors
like QED, by the time
of V7, Bell Labs was running Unix on reasonably good hardware for the
time.
The conclusion is inescapable. The people at Bell Labs who created
Unix found ed
to be a good editor. Since they got so much else
right and saw so many things so clearly, perhaps we should consider
that ed
itself has merits that we don't see today, or don't see
as acutely as they did back then.