Some thoughts on Prometheus Alertmanager's alert reminders
One of the Alertmanager configuration parameters you can set is the 'repeat interval' for an alert route, which is described as:
How long to wait before sending a notification again if it has already been sent successfully for an alert. (Usually ~3h or more).
Translated, this is how often you get a reminder about an alert (or a group of them) that is still active (and otherwise unchanged; if something changes in the group of alerts, that's a different configuration setting).
We set our
repeat_interval to 24 hours, and we recently came
back from a holiday break where an alert triggered on December 28th
and stayed on until we returned and fixed it, resulting in
reminder email on the 29th, the 30th, and so on, none of which we
were dealing with over the break. This has given me an opportunity
to think about our setting and about alert reminders in general.
The first question to ask yourself is whether an alert reminder is ever going to be useful. In some places the answer is probably 'no', for example if you have a dashboard of active alerts that people look at all the time. Otherwise, if an alert reminder is useful, you want to ask questions like what is it useful for, to who, and when. The answers for a 24/7 operations team with shift changes every six hours might be quite different than for a small group of university system administrators who only work regular office hours.
For us, the answer is probably that reminders are useful to make sure that we don't let something drop through the cracks. But there is no fixed 'time since the last email about the alert' that's ideal; instead, in an ideal world we'd probably want to be reminded on workdays at about 8:30 am for anything that had come up before 5pm the previous workday, and at about 4pm for anything since 5pm the previous day that was more than an hour or two old. The 8:30 am reminder would be 'this probably slipped through the cracks yesterday', and the 4pm one would be 'this is still going on, maybe it can be fixed today before we stop'.
(Of course supporting this sort of flexible 'repeat interval' in Alertmanager would probably be a lot more work than the current code, although Alertmanager does have support for time intervals since 0.22.0.)
Given our overall use for alert reminders, a repeat interval of 24 hours is probably as reasonably good as anything else. It has the benefit of being predictable and not sending too much email, and we mostly don't get reminders anyway since we generally fix the problem within a day (during the work week; weekend problems usually wait until Monday). Pragmatically, if we start to need alert reminders on any frequent basis that's a sign of several things being wrong, and possibly one way to deal with it is to build a specific Grafana dashboard to show (only) all of our current alerts.
(Another option is that the Alertmanager APIs export enough information that we could manually generate an early morning 'here are all of the outstanding alerts' email.)