Some things about Prometheus Alertmanager's notification metrics
Alertmanager exposes some metrics that are potentially useful if you want a more complete view of its health than just whether or not its up. However, some of the metrics surrounding its notifications aren't clearly named or documented, which left me with questions about them in yesterday's entry. Today I went looking in the code, and here is the current state of affairs (as of the start of 2022).
Alertmanager has two sets of metrics about total and failed notifications (and one set about their latency); one set is about notification_requests, and one set is about plain notifications. Although the latency histogram buckets don't have the 'requests' bit in their metrics names, they are actually about requests, not the plain one.
An Alertmanager 'notification', as counted by these metrics, is a high level attempt to deliver a (new) message about changed alert state to some integration (for some receiver). A notification request is the concrete attempt to prepare and submit the notification message to the particular integration delivery point. In the process of submitting this notification request to the endpoint, Alertmanager can experience both temporary failures (your SMTP server isn't accepting connections) or permanent ones (a template has errors). If there are temporary errors, Alertmanager will make some number of retries. Each such retry is a new notification request, but not a new notification.
(I believe that retries happen with backoff until the next
group_interval
, at which point
the entire notification is aborted, but I'm not convinced I fully
understand the code. The notification and metrics are in
notify/notify.go,
while I believe that the Go context that's canceled to stop
notification retries comes from dispatch/dispatch.go's
aggrGroup, in its run()
.)
The one complication in this is that if all of the alerts in the
notification have been resolved (with no firing ones left), and you
don't have send_resolved
set to true for the integration (in
that receiver), then there will be a 'notification' but Alertmanager
will make no notification requests. This can cause the number of
notifications to slowly rise about the number of notification
requests. If you have notification request retries, the
reverse can happen (but these are always accompanied by notification
request failures).
If Alertmanager reports plain notification failures, in the metric alertmanager_notifications_failed_total, I believe that you've missed some alerts (through whatever specific integration the label says). If Alertmanager reports that notification requests are failing but there are no plain notification failures, you have an issue that Alertmanager thinks is temporary and is retrying. As far as I can see, you can't have a notification failure without having at least one notification request failure, reported in the metric alertmanager_notification_requests_failed_total.
For Alertmanager email notification requests in specific, most SMTP
time failures appear to be considered temporary, and this is done
without checking something like the SMTP error code. If I'm correctly
reading the code in notify/email/email.go,
the exception is if you experience any sort of error while sending
the actual text of the email itself (either the headers or the
body). The moral of this story appears to be that your email server
had better accept everything (and not crash) once it's given a
positive reply to Alertmanager's initial DATA
command. Template
expansion failures are permanent and are not retried, but if your
SMTP server doesn't accept the MAIL FROM
or RCPT TO
addresses
(or commands), it's a temporary error and Alertmanager will retry.
Sidebar: All of the metrics names for Internet search purposes
The general notification metrics are:
alertmanager_notifications_total
alertmanager_notifications_failed_total
The notification request metrics are:
alertmanager_notification_requests_total
alertmanager_notification_requests_failed_total
alertmanager_notification_latency_seconds_bucket
alertmanager_notification_latency_seconds_sum
alertmanager_notification_latency_seconds_count
I believe that the latency histogram count metric should be the same as the total notification requests metric. Failed notification requests are included in the latency metrics, which means that things like SMTP timeouts could drive up your overall latency.
|
|