My new solution for quiet monitoring of our Prometheus alerts

November 22, 2024

Our Prometheus setup delivers all alert messages through email, because we do everything through email (as a first approximation). As we saw yesterday, doing everything through email has problems when your central email server isn't responding; Prometheus raised alerts about the problems but couldn't deliver them via email because the core system necessary to deliver email wasn't doing so. Today, I built myself a little X based system to get around that, using the same approach as my non-interrupting notification of new email.

At a high level, what I now have is an xlbiff based notification of our current Prometheus alerts. If there are no alerts, everything is quiet. If new alerts appear, xlbiff will pop up a text window over in the corner of my screen with a summary of what hosts have what alerts; I can click the window to dismiss it. If the current set of alerts changes, xlbiff will re-display the alerts. I currently have xlbiff set to check the alerts every 45 seconds, and I may lengthen that at some point.

(The current frequent checking is because of what started all of this; if there are problems with our email alert notifications, I want to know about it pretty promptly.)

The work of fetching, checking, and formatting alerts is done by a Python program I wrote. To get the alerts, I directly query our Prometheus server rather than talking to Alertmanager; as a side effect, this lets me see pending alerts as well (although then I have to have the Python program ignore a bunch of pending alerts that are too flaky). I don't try to do the ignoring with clever PromQL queries; instead the Python program gets everything and does the filtering itself.

Pulling the current alerts directly from Prometheus means that I can't readily access the explanatory text we add as annotations (and that then appears in our alert notification emails), but for the purposes of a simple notification that these alerts exist, the name of the alert or other information from the labels is good enough. This isn't intended to give me full details about the alerts, just to let me know what's out there. Most of the time I'll get email about the alert (or alerts) soon anyway, and if not I can directly look at our dashboards and Alertmanager.

To support this sort of thing, xlbiff has the notion of a 'check' program that can print out a number every time it runs, and will get passed the last invocation's number on the command line (or '0' at the start). Using this requires boiling down the state of the current alerts to a single signed 32-bit number. I could have used something like the count of current alerts, but me being me I decided to be more clever. The program takes the start time of every current alert (from the ALERTS_FOR_STATE Prometheus metric), subtracts a starting epoch to make sure we're not going to overflow, and adds them all up to be the state number (which I call a 'checksum' in my code because I started out thinking about more complex tricks like running my output text through CRC32).

(As a minor wrinkle, I add one second to the start time of every firing alert so that when alerts go from pending to firing the state changes and xlbiff will re-display things. I did this because pending and firing alerts are presented differently in the text output.)

To get both the start time and the alert state, we must use the usual trick for pulling in extra labels:

ALERTS_FOR_STATE * ignoring(alertstate) group_left(alertstate) ALERTS

I understand why ALERTS_FOR_STATE doesn't include the alert state, but sometimes it does force you to go out of your way.

PS: If we had alerts going off all of the time, this would be far too obtrusive an approach. Instead, our default state is that there are no alerts happening, so this alert notifier spends most of its time displaying nothing (well, having no visible window, which is even better).


Comments on this page:

By Perry Lorier at 2024-11-24 17:15:35:

We had a similar problem, and we had a bit more control over the full end to end. We have a test which always fails, which triggers an alert that always fires, which generates a page into a queue (but doesn't generate emails or pagers). Then we have our pager android app poll the queue, and if it doesn't see a new recent alert, it generates a page locally saying "Monitoring is not paging" (or words to that effect).

This tests the full end to end from the test runner all the way through to page delivery on the phone. If any part in the chain breaks, the phone autonomously alerts us that monitoring is broken.

Written on 22 November 2024.
« Our Prometheus alerting problem if our central mail server isn't working
The general issue of terminal programs and the Alt key »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Fri Nov 22 22:25:26 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.