Sometimes alerts have inobvious reasons for existing
Somewhat recently I saw people saying negative things about common alerting practices, specifically such as generating some sort of alert when a TLS certificate was getting close to expiring. This got me to tweet something:
We don't have 'your TLS certificate is X days from expiring' alerts to tell me that we need to renew a certificate; we have them to tell us that our Let's Encrypt automation broke, and early enough that we have plenty of time to deal with the situation.
(All of our alerts go to email and are only dealt with during working hours.)
Certbot normally renews our TLS certificates when they're 30 days from expiring, and we alert if a certificate is less than 23 days from expiring. This gives Certbot a week of leeway for problems (including the machine being down for a day or three), and gives us three weeks to deal with the problem in some way, including by manually getting a certificate from another source if we have to. We also have a 'how many days to expiry' table for TLS certificates in our overall dashboard, so we can notice even before the alert if a certificate isn't getting renewed when it should be.
But none of this is visible in a simple description of what we alert on. The bare fact that we alert if a TLS certificate is less than 23 days from expiring doesn't tell you why that alert exists, and the why can have a good reason behind it (as I feel we do for this alert). As a corollary you can't tell whether an alert is sensible or not just from its description.
Another very important corollary is the same thing we saw for configuration management and procedures, which is that by themselves your alerts don't tell you why you put them into place, just what you're alerting on. Understanding why an alert exists is important, so you want to document that too, in comments or in the alert messages or both. Even if the bare alert seems to be obviously sensible, you should document why you picked that particular thing to alert on to tell you about the overall problem. It's probably useful to describe what high level problem (or problems) the alert is trying to pick up on, since that isn't necessarily obvious either.
Having this sort of 'why' documentation is especially important for alerts because alerts are notorious for drifting out of sync with reality, at which point you need to bring them back in line for them to be useful. This is effectively debugging, and now I will point everyone to Code Only Says What it Does and paraphrase a section of the lead paragraph:
Fundamentally, [updating alerts] is an exercise in changing what [an alert] does to match what it should do. It requires us to know what [an alert] should do, which isn't captured in the alert.
So, alerts have intentions, and we should make sure to document those intentions. Without the intentions, any alert can look stupid.