Sometimes alerts have inobvious reasons for existing

October 26, 2020

Somewhat recently I saw people saying negative things about common alerting practices, specifically such as generating some sort of alert when a TLS certificate was getting close to expiring. This got me to tweet something:

We don't have 'your TLS certificate is X days from expiring' alerts to tell me that we need to renew a certificate; we have them to tell us that our Let's Encrypt automation broke, and early enough that we have plenty of time to deal with the situation.

(All of our alerts go to email and are only dealt with during working hours.)

Certbot normally renews our TLS certificates when they're 30 days from expiring, and we alert if a certificate is less than 23 days from expiring. This gives Certbot a week of leeway for problems (including the machine being down for a day or three), and gives us three weeks to deal with the problem in some way, including by manually getting a certificate from another source if we have to. We also have a 'how many days to expiry' table for TLS certificates in our overall dashboard, so we can notice even before the alert if a certificate isn't getting renewed when it should be.

But none of this is visible in a simple description of what we alert on. The bare fact that we alert if a TLS certificate is less than 23 days from expiring doesn't tell you why that alert exists, and the why can have a good reason behind it (as I feel we do for this alert). As a corollary you can't tell whether an alert is sensible or not just from its description.

Another very important corollary is the same thing we saw for configuration management and procedures, which is that by themselves your alerts don't tell you why you put them into place, just what you're alerting on. Understanding why an alert exists is important, so you want to document that too, in comments or in the alert messages or both. Even if the bare alert seems to be obviously sensible, you should document why you picked that particular thing to alert on to tell you about the overall problem. It's probably useful to describe what high level problem (or problems) the alert is trying to pick up on, since that isn't necessarily obvious either.

Having this sort of 'why' documentation is especially important for alerts because alerts are notorious for drifting out of sync with reality, at which point you need to bring them back in line for them to be useful. This is effectively debugging, and now I will point everyone to Code Only Says What it Does and paraphrase a section of the lead paragraph:

Fundamentally, [updating alerts] is an exercise in changing what [an alert] does to match what it should do. It requires us to know what [an alert] should do, which isn't captured in the alert.

So, alerts have intentions, and we should make sure to document those intentions. Without the intentions, any alert can look stupid.


Comments on this page:

By Perry Lorier at 2020-10-28 16:05:28:

The general alerting philosophy for SRE at Google is to alert on symptoms, not causes. There's an infinite number of causes, but what you care about is the symptom of the problem.

It might seem obvious to alert on certbot failing to run, but that that's a "cause based" alert. Perhaps certbot isn't configured to run at all, so it's never failing? Perhaps certbot is running successfully, but updating the wrong cert(s) or the right certs in the wrong files? and so on.

So we end up looking from as close to the point of view of the user (or consumer) of the service as possible and seeing the result of what's happening.

A symptom is that the certificates that are served to users are not getting updated correctly and have an expiry less than they should.

The downside with symptom based alerting (as you point out), is that they tend to say "HELP HELP SOMETHING IS WRONG!" with almost zero context as to what might actually be wrong. So you then need to figure out how it's supposed to be working, and why it isn't working, which can be stressful and difficult (especially since it's not working now, so it's not obvious how it was supposed to work).

Some teams end up writing playbooks: "When you get this alert check X Y and Z, and perhaps validate it's not bug #1234 that's hit us again", I think that this is not a great plan, because if it's not X, Y or Z or bug #1234, then what? (and if it is frequently X, Y, Z or bug #1234, then perhaps you need to put a more automated mitigation in for them).

Instead, you want to link to something that says "here's how it's supposed to work, and how we're checking that it's working" and give people enough information they can easily debug from first principles. Perhaps a wiki page with embedded diagrams. ("certbot is run from /etc/crontab daily at 4am. certbots configuration is in /etc/certbot/, and checked into git at <whereever>. certbots logs are kept in /var/log/certbot/... known bugs in certbot are <link to search for all open bugs tagged with certbot>" and so on).

(which I think is a rephrasing of what you said in your post, just using the terms that Google talks about in the various SRE books)

Written on 26 October 2020.
« Link: [Firefox] Navigational Instruments
An issue with Pip installed packages and Python versions (on Unix) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Oct 26 23:37:09 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.