2024-03-27
Some questions to ask about what silencing alerts means
A common desired feature for an alert notification system is that you can silence (some) alert notifications for a while. You might silence alerts about things that are under planned maintenance, or do it generally in the dead of night for things that aren't important enough to wake someone. This sounds straightforward but in practice my simple description here is under-specified and raises some questions about how things behave (or should behave).
The simplest implementation of silencing alert notifications is for the alerting system to go through all of its normal process for sending notifications but not actually deliver the notifications; the notifications are discarded, diverted to /dev/null, or whatever. In the view of the overall system, the alert notifications were successfully delivered, while in your view you didn't get emailed, paged, notified in some chat channel, or whatever.
However, there are a number of situations where you may not want to discard alert notifications this way, but instead defer them until after the silence has ended. Here are some cases:
- If an alert starts during the silence and is still in effect when
the silence ends, many people will want to get an alert notification
about it at (or soon after) the end of the silence. Otherwise,
you have to remember to look at dashboards or other sources of
alert information to see what current problems you have.
- If an alert started before the silence and ends (resolves) during
the silence, some people will want to get an alert notification
about the alert having been resolved at the end of the silence.
Otherwise you're once again left to look at your dashboards to
notice that some things cleaned up during the silence.
(This assumes you normally send notifications about resolved alerts, which not everyone does.)
- If an alert both starts and ends during the silence, most people will
say that you shouldn't get an alert notification about it afterward.
Otherwise silences would simply defer alert notifications about things
like planned maintenance, not eliminate them. However, some people
would like to get some sort of summary or general notification about
alerts that came up and got resolved during the silence.
(This is perhaps especially likely for the 'silence in the depths of the night' or 'silence over the weekend' sorts of schedule based silencing. You may still want to know that things happened, just not bother people with them on the spot.)
Whether you want post-silence alert notifications in some or all of these situations will depend in part on what you use alert notifications for (or how the designers of your system expect this to work). In some environments, an alert notification is in effect a message that says 'go look at your dashboards', so you don't need this at the end of a planned maintenance since you're probably already doing that. In other environments, the alert notification is either the primary signal that something is wrong or the primary source of information for what to do about it (by carrying links to runbooks, suggested remediations, relevant dashboards, and so on). Getting an alert notification for 'new' alerts is then vital because that's primarily how you know you have to do something and maybe know what to do.
(And in some environments, getting alert notifications about resolved alerts is the primary method people use to track outstanding alerts, making those important.)