Some things about Prometheus Alertmanager's notification metrics

January 11, 2022

Alertmanager exposes some metrics that are potentially useful if you want a more complete view of its health than just whether or not its up. However, some of the metrics surrounding its notifications aren't clearly named or documented, which left me with questions about them in yesterday's entry. Today I went looking in the code, and here is the current state of affairs (as of the start of 2022).

Alertmanager has two sets of metrics about total and failed notifications (and one set about their latency); one set is about notification_requests, and one set is about plain notifications. Although the latency histogram buckets don't have the 'requests' bit in their metrics names, they are actually about requests, not the plain one.

An Alertmanager 'notification', as counted by these metrics, is a high level attempt to deliver a (new) message about changed alert state to some integration (for some receiver). A notification request is the concrete attempt to prepare and submit the notification message to the particular integration delivery point. In the process of submitting this notification request to the endpoint, Alertmanager can experience both temporary failures (your SMTP server isn't accepting connections) or permanent ones (a template has errors). If there are temporary errors, Alertmanager will make some number of retries. Each such retry is a new notification request, but not a new notification.

(I believe that retries happen with backoff until the next group_interval, at which point the entire notification is aborted, but I'm not convinced I fully understand the code. The notification and metrics are in notify/notify.go, while I believe that the Go context that's canceled to stop notification retries comes from dispatch/dispatch.go's aggrGroup, in its run().)

The one complication in this is that if all of the alerts in the notification have been resolved (with no firing ones left), and you don't have send_resolved set to true for the integration (in that receiver), then there will be a 'notification' but Alertmanager will make no notification requests. This can cause the number of notifications to slowly rise about the number of notification requests. If you have notification request retries, the reverse can happen (but these are always accompanied by notification request failures).

If Alertmanager reports plain notification failures, in the metric alertmanager_notifications_failed_total, I believe that you've missed some alerts (through whatever specific integration the label says). If Alertmanager reports that notification requests are failing but there are no plain notification failures, you have an issue that Alertmanager thinks is temporary and is retrying. As far as I can see, you can't have a notification failure without having at least one notification request failure, reported in the metric alertmanager_notification_requests_failed_total.

For Alertmanager email notification requests in specific, most SMTP time failures appear to be considered temporary, and this is done without checking something like the SMTP error code. If I'm correctly reading the code in notify/email/email.go, the exception is if you experience any sort of error while sending the actual text of the email itself (either the headers or the body). The moral of this story appears to be that your email server had better accept everything (and not crash) once it's given a positive reply to Alertmanager's initial DATA command. Template expansion failures are permanent and are not retried, but if your SMTP server doesn't accept the MAIL FROM or RCPT TO addresses (or commands), it's a temporary error and Alertmanager will retry.

Sidebar: All of the metrics names for Internet search purposes

The general notification metrics are:

alertmanager_notifications_total
alertmanager_notifications_failed_total

The notification request metrics are:

alertmanager_notification_requests_total
alertmanager_notification_requests_failed_total
alertmanager_notification_latency_seconds_bucket
alertmanager_notification_latency_seconds_sum
alertmanager_notification_latency_seconds_count

I believe that the latency histogram count metric should be the same as the total notification requests metric. Failed notification requests are included in the latency metrics, which means that things like SMTP timeouts could drive up your overall latency.

Written on 11 January 2022.
« The complexity of seeing if your Prometheus Alertmanager is truly healthy
My sunk cost fallacy relationship with my home desktop »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jan 11 23:02:36 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.