Our Prometheus alerting problem if our central mail server isn't working

November 21, 2024

Over on the Fediverse, I said something:

Ah yes, the one problem that our Prometheus based alert system can't send us alert email about: when the central mail server explodes. Who rings the bell to tell you that the bell isn't working?

(This is of course an aspect of monitoring your Prometheus setup itself, and also seeing if Alertmanager is truly healthy.)

There is a story here. The short version of the story is that today we wound up with a mail loop that completely swamped our central Exim mail server, briefly running its one minute load average up to a high water mark of 3,132 before a co-worker who'd noticed the problem forcefully power cycled it. Plenty of alerts fired during the incident, but since we do all of our alert notification via email and our central email server wasn't delivering very much email (on account of that load average, among other factors), we didn't receive any.

The first thing to note is that this is a narrow and short term problem for us (which is to say, me and my co-workers). On the short term side, we send and receive enough email that not receiving email for very long during working hours is unusual enough that someone would have noticed before too long, in fact my co-worker noticed the problems even without an alert actively being triggered. On the narrow side, I failed to notice this as it was going on because the system stayed up, it just wasn't responsive. Once the system was rebooting, I noticed almost immediately because I was in the office and some of the windows on my office desktop disappeared.

(In that old version of my desktop I would have noticed the issue right away, because an xload for the machine in question was right in the middle of these things. These days it's way off to the right side, out of my routine view, but I could change that back.)

One obvious approach is some additional delivery channel for alerts about our central mail server. Unfortunately, we're entirely email focused; we don't currently use Slack, Teams, or other online chatting systems, so sending selected alerts to any of them is out as a practical option. We do have work smartphones, so in theory we could send SMS messages; in practice, free email to SMS gateways have basically vanished, so we'd have to pay for something (either for direct SMS access and we'd build some sort of system on top, or for a SaaS provider who would take some sort of notification and arrange to deliver it via SMS).

For myself, I could probably build some sort of script or program that regularly polled our Prometheus server to see if there were any relevant alerts. If there were, the program would signal me somehow, either by changing the appearance of a status window in a relatively unobtrusive way (eg turning it red) or popping up some sort of notification (perhaps I could build something around a creative use of xlbiff to display recent alerts, although this isn't as simple as it looks).

(This particular idea is a bit of a trap, because I could spend a lot of time crafting a little X program that, for example, had a row of boxes that were green, yellow, or red depending on the alert state of various really important things.)


Comments on this page:

By dana at 2024-11-22 00:30:13:

One obvious approach is some additional delivery channel for alerts about our central mail server. Unfortunately, we're entirely email focused

The idea of a secondary e-mail server comes to my mind. I imagine y'all wouldn't have much trouble setting up a tiny mostly-internal one; you could probably skip spam-filtering, DKIM, web-mail, and all that complexity. Give an account to each admin, who'll set their work phone to poll it and warn them if the polling fails.

How about using a self-hosted instance of Ntfy?

It can expose an (unauthenticated) inbound SMTP endpoint as an alternative to the HTTP-based API for alert submission, and then you can use either the Android or iOS application for receiving alerts, or integrate it any other way.

For example I use it with great success for my own homelab for SMARTd, MDadm, UPS and other alerts.

My approach has been to use a third-party email-to-SMS service. The monitoring server, having been told which monitored services are critical to the mail delivery path, will use SMS-via-email to notify regarding failures in those services.

This means the monitoring server must be able to deliver email in its own right, rather than using the enterprise server as smarthost, but I find that a small additional cost.

In one case, where for various reasons the above was not possible, we attached a GSM modem to the monitoring server via RS232, gave it a PAYG SIM, and had it deliver SMSes directly via GSM. That, of course, meant writing a service check for PAYG balance, but that wasn't too tricky.

Written on 21 November 2024.
« Thinking about how to tame the interaction of conditional GET and caching
My new solution for quiet monitoring of our Prometheus alerts »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Thu Nov 21 23:04:14 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.