The complexity of seeing if your Prometheus Alertmanager is truly healthy
We have a fairly straightforward external check for our Alertmanager being up (as part of our general setup), and I thought we were pretty well covered with it. Then over on Twitter I got into an educational thread with Augie Fackler, who'd had an interesting Alertmanager failure mode:
alertmanager was up, but had a borked smtp config (old hostname that got deleted). As a result alertmanager's UI acted like it had no alerts, which is super confusing to me as a failure mode.
Our basic monitoring wouldn't have caught this, because all it looks for is that Alertmanager is up at all. Once you start worrying about things that could be wrong in your Alertmanager configuration and its operation, there's a lot that could be going wrong. In our own configuration, I've seen template expansion errors and incorrect SMTP parameters, and occasionally I've made syntax errors in the Alertmanager configuration such that a reload failed.
One of the options is a constantly firing "deadman" alert, which led to David Leadbeater mentioning an interesting program, the Prometheus Monitoring Safety Device (PromMSD). To quote from its readme, PromMSD is designed to (constantly) receive an alert from your Alertmanager; if it doesn't get an alert, it raises one. PromMSD is a good safety check and it has the advantage that you don't have to build any additional tools, but in an environment like ours it doesn't quite do an end to end check because the alert sent to PromMSD doesn't use the same alert delivery mechanism. PromMSD will detect a lot, but it might not have detected Augie Fackler's SMTP issue.
Alertmanager itself has some metrics that may give you clues, not all of which are as useful as they look. One metric that isn't currently useful is looking at is 'alertmanager_alerts {state="active"}', which currently up endlessly, as covered in issue #2619 and issue #1439 (at least).
Some potentially useful Alertmanager metrics that I may want to monitor are:
- alertmanager_config_last_reload_successful
- Was the last configuration
reload successful. If this is zero for a while, I failed to notice
that my '
amtool check-config
' actually failed, triggered a reload, and walked away. - alertmanager_notifications_failed_total
- A per-integration count of
how many notifications failed. An integration is 'email' or
'webhook', rather than particular destinations or configurations,
so this is very broad. But in many cases you expect this to be
zero, if you're sending only to your own captive receivers (email
or webhook) that should always be up.
There's also a set of metrics for notification requests. I don't know what the difference is between notifications and notification requests, but in our Alertmanager, there are fewer total notification requests than there are notifications.
I don't think there's any specific metric for template expansion failures. It might fall under the general one of notification failures. I generally don't touch our templates these days so I'm not as nervous about this as I used to be.
Sidebar: SMTP is complicated to track and report failures for
Our Alertmanager configuration can use two different SMTP servers and send alerts to a variety of different email addresses, including user-provided ones. There are no less than eight different Alertmanager receivers involved in all of this, and although today they all use the same sender address, that could change in the future. A given SMTP server we try to use could be bad in general, it could refuse some but not all sender addresses, or it could refuse some but not all recipient addresses. If Alertmanager wanted to provide granular information on SMTP notification failures, how should it split it up?
I sort of wish that Alertmanager would provide a set of failure numbers on a per Alertmanager receiver basis, or better yet on the combination of receiver and integration (so 'x receiver, for email'). However I guess the answer is to monitor for total failures and then go check the logs if any are reported.
|
|