The complexity of seeing if your Prometheus Alertmanager is truly healthy

January 10, 2022

We have a fairly straightforward external check for our Alertmanager being up (as part of our general setup), and I thought we were pretty well covered with it. Then over on Twitter I got into an educational thread with Augie Fackler, who'd had an interesting Alertmanager failure mode:

alertmanager was up, but had a borked smtp config (old hostname that got deleted). As a result alertmanager's UI acted like it had no alerts, which is super confusing to me as a failure mode.

Our basic monitoring wouldn't have caught this, because all it looks for is that Alertmanager is up at all. Once you start worrying about things that could be wrong in your Alertmanager configuration and its operation, there's a lot that could be going wrong. In our own configuration, I've seen template expansion errors and incorrect SMTP parameters, and occasionally I've made syntax errors in the Alertmanager configuration such that a reload failed.

One of the options is a constantly firing "deadman" alert, which led to David Leadbeater mentioning an interesting program, the Prometheus Monitoring Safety Device (PromMSD). To quote from its readme, PromMSD is designed to (constantly) receive an alert from your Alertmanager; if it doesn't get an alert, it raises one. PromMSD is a good safety check and it has the advantage that you don't have to build any additional tools, but in an environment like ours it doesn't quite do an end to end check because the alert sent to PromMSD doesn't use the same alert delivery mechanism. PromMSD will detect a lot, but it might not have detected Augie Fackler's SMTP issue.

Alertmanager itself has some metrics that may give you clues, not all of which are as useful as they look. One metric that isn't currently useful is looking at is 'alertmanager_alerts {state="active"}', which currently up endlessly, as covered in issue #2619 and issue #1439 (at least).

Some potentially useful Alertmanager metrics that I may want to monitor are:

alertmanager_config_last_reload_successful
Was the last configuration reload successful. If this is zero for a while, I failed to notice that my 'amtool check-config' actually failed, triggered a reload, and walked away.

alertmanager_notifications_failed_total
A per-integration count of how many notifications failed. An integration is 'email' or 'webhook', rather than particular destinations or configurations, so this is very broad. But in many cases you expect this to be zero, if you're sending only to your own captive receivers (email or webhook) that should always be up.

There's also a set of metrics for notification requests. I don't know what the difference is between notifications and notification requests, but in our Alertmanager, there are fewer total notification requests than there are notifications.

I don't think there's any specific metric for template expansion failures. It might fall under the general one of notification failures. I generally don't touch our templates these days so I'm not as nervous about this as I used to be.

Sidebar: SMTP is complicated to track and report failures for

Our Alertmanager configuration can use two different SMTP servers and send alerts to a variety of different email addresses, including user-provided ones. There are no less than eight different Alertmanager receivers involved in all of this, and although today they all use the same sender address, that could change in the future. A given SMTP server we try to use could be bad in general, it could refuse some but not all sender addresses, or it could refuse some but not all recipient addresses. If Alertmanager wanted to provide granular information on SMTP notification failures, how should it split it up?

I sort of wish that Alertmanager would provide a set of failure numbers on a per Alertmanager receiver basis, or better yet on the combination of receiver and integration (so 'x receiver, for email'). However I guess the answer is to monitor for total failures and then go check the logs if any are reported.

Written on 10 January 2022.
« I have mixed feelings about the Go time package's time formatting strings
Some things about Prometheus Alertmanager's notification metrics »

Page tools: View Source.
Search:
Login: Password:

Last modified: Mon Jan 10 23:43:55 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.