My new solution for quiet monitoring of our Prometheus alerts
Our Prometheus setup delivers all alert messages through email, because we do everything through email (as a first approximation). As we saw yesterday, doing everything through email has problems when your central email server isn't responding; Prometheus raised alerts about the problems but couldn't deliver them via email because the core system necessary to deliver email wasn't doing so. Today, I built myself a little X based system to get around that, using the same approach as my non-interrupting notification of new email.
At a high level, what I now have is an xlbiff based notification of our current Prometheus alerts. If there are no alerts, everything is quiet. If new alerts appear, xlbiff will pop up a text window over in the corner of my screen with a summary of what hosts have what alerts; I can click the window to dismiss it. If the current set of alerts changes, xlbiff will re-display the alerts. I currently have xlbiff set to check the alerts every 45 seconds, and I may lengthen that at some point.
(The current frequent checking is because of what started all of this; if there are problems with our email alert notifications, I want to know about it pretty promptly.)
The work of fetching, checking, and formatting alerts is done by a Python program I wrote. To get the alerts, I directly query our Prometheus server rather than talking to Alertmanager; as a side effect, this lets me see pending alerts as well (although then I have to have the Python program ignore a bunch of pending alerts that are too flaky). I don't try to do the ignoring with clever PromQL queries; instead the Python program gets everything and does the filtering itself.
Pulling the current alerts directly from Prometheus means that I can't readily access the explanatory text we add as annotations (and that then appears in our alert notification emails), but for the purposes of a simple notification that these alerts exist, the name of the alert or other information from the labels is good enough. This isn't intended to give me full details about the alerts, just to let me know what's out there. Most of the time I'll get email about the alert (or alerts) soon anyway, and if not I can directly look at our dashboards and Alertmanager.
To support this sort of thing, xlbiff has the notion of a 'check'
program that can print out a number every time it runs, and will
get passed the last invocation's number on the command line (or '0'
at the start). Using this requires boiling down the state of the
current alerts to a single signed 32-bit number. I could have used
something like the count of current alerts, but me being me I decided
to be more clever. The program takes the start time of every current
alert (from the ALERTS_FOR_STATE
Prometheus metric), subtracts
a starting epoch to make sure we're not going to overflow, and adds
them all up to be the state number (which I call a 'checksum' in
my code because I started out thinking about more complex tricks
like running my output text through CRC32).
(As a minor wrinkle, I add one second to the start time of every firing alert so that when alerts go from pending to firing the state changes and xlbiff will re-display things. I did this because pending and firing alerts are presented differently in the text output.)
To get both the start time and the alert state, we must use the usual trick for pulling in extra labels:
ALERTS_FOR_STATE * ignoring(alertstate) group_left(alertstate) ALERTS
I understand why ALERTS_FOR_STATE
doesn't include the alert state,
but sometimes it does force you to go out of your way.
PS: If we had alerts going off all of the time, this would be far too obtrusive an approach. Instead, our default state is that there are no alerts happening, so this alert notifier spends most of its time displaying nothing (well, having no visible window, which is even better).
Comments on this page:
|
|