Some things about Prometheus Alertmanager's notification metrics
Alertmanager exposes some metrics that are potentially useful if you want a more complete view of its health than just whether or not its up. However, some of the metrics surrounding its notifications aren't clearly named or documented, which left me with questions about them in yesterday's entry. Today I went looking in the code, and here is the current state of affairs (as of the start of 2022).
Alertmanager has two sets of metrics about total and failed notifications (and one set about their latency); one set is about notification_requests, and one set is about plain notifications. Although the latency histogram buckets don't have the 'requests' bit in their metrics names, they are actually about requests, not the plain one.
An Alertmanager 'notification', as counted by these metrics, is a high level attempt to deliver a (new) message about changed alert state to some integration (for some receiver). A notification request is the concrete attempt to prepare and submit the notification message to the particular integration delivery point. In the process of submitting this notification request to the endpoint, Alertmanager can experience both temporary failures (your SMTP server isn't accepting connections) or permanent ones (a template has errors). If there are temporary errors, Alertmanager will make some number of retries. Each such retry is a new notification request, but not a new notification.
(I believe that retries happen with backoff until the next
group_interval, at which point
the entire notification is aborted, but I'm not convinced I fully
understand the code. The notification and metrics are in
while I believe that the Go context that's canceled to stop
notification retries comes from dispatch/dispatch.go's
aggrGroup, in its
The one complication in this is that if all of the alerts in the
notification have been resolved (with no firing ones left), and you
send_resolved set to true for the integration (in
that receiver), then there will be a 'notification' but Alertmanager
will make no notification requests. This can cause the number of
notifications to slowly rise about the number of notification
requests. If you have notification request retries, the
reverse can happen (but these are always accompanied by notification
If Alertmanager reports plain notification failures, in the metric alertmanager_notifications_failed_total, I believe that you've missed some alerts (through whatever specific integration the label says). If Alertmanager reports that notification requests are failing but there are no plain notification failures, you have an issue that Alertmanager thinks is temporary and is retrying. As far as I can see, you can't have a notification failure without having at least one notification request failure, reported in the metric alertmanager_notification_requests_failed_total.
For Alertmanager email notification requests in specific, most SMTP
time failures appear to be considered temporary, and this is done
without checking something like the SMTP error code. If I'm correctly
reading the code in notify/email/email.go,
the exception is if you experience any sort of error while sending
the actual text of the email itself (either the headers or the
body). The moral of this story appears to be that your email server
had better accept everything (and not crash) once it's given a
positive reply to Alertmanager's initial
DATA command. Template
expansion failures are permanent and are not retried, but if your
SMTP server doesn't accept the
MAIL FROM or
RCPT TO addresses
(or commands), it's a temporary error and Alertmanager will retry.
Sidebar: All of the metrics names for Internet search purposes
The general notification metrics are:
The notification request metrics are:
I believe that the latency histogram count metric should be the same as the total notification requests metric. Failed notification requests are included in the latency metrics, which means that things like SMTP timeouts could drive up your overall latency.