Using alerts as tests that guard against future errors

September 30, 2019

On Twitter, I said:

These days, I think of many of our alerts as tests, like code tests to verify that bugs don't come back. If we broke something in the past and didn't notice or couldn't easily spot what was wrong, we add an alert (and a metric or check for it to use, if necessary).

So we have an alert for 'can we log in with POP3' (guess what I broke once, and surprise, GMail uses POP3 to pull email from us), and one for 'did we forget to commit this RCS file and broke self-serve device registration', and so on.

(The RCS file alert is a real one; I mentioned it here.)

In modern programming, it's conventional that when you find a bug in your code, you usually write a test that checks for it (before you fix the bug). This test is partly to verify that you actually fixed the bug, but it's also there to guard against the bug ever coming back; after all, if you got it wrong once, you might accidentally get it wrong again in the future. You can find a lot of these tests over modern codebases, especially in tricky areas, and if you read the commit logs you can usually find people saying exactly this about the newly added tests.

As sysadmins here, how we operate our systems isn't exactly programming, but I think that some of the same principles apply. Like programmers, we're capable of breaking things or setting up something that is partially but not completely working. When that happens, we can fix it (like programmers fixing a bug) and move on, or we can recognize that if we made a mistake once, we might make the same mistake later (or a similar one that has the same effects), just like issues in programs can reappear.

(If anything, I tend to think that traditional style sysadmins are more prone to re-breaking things than programmers are because we routinely rebuild our 'programs', ie our systems, due to things like operating systems and programs getting upgraded. Every new version of Ubuntu and its accompanying versions of Dovecot, Exim, Apache, and so on is a new chance to recreate old problems, and on top of that we tend to build things with complex interdependencies that we often don't fully understand or realize.)

In this environment, my version of tests has become alerts. As I said in the tweets, if we broke something in the past and didn't notice, I'll add an alert for it to make sure that if we do it again, we'll find out right away this time around. Just as with the tests that programmers add, I don't expect these alerts to ever fire, and certainly not very often; if they do fire frequently, then either they're bad (just as tests can be bad) or we have a process problem, where we need to change how we operate so we stop making this particular mistake so often.

This is somewhat of a divergence from the usual modern theory of alerts, which is that you should have only a few alerts and they should mostly be about things that cause people pain. However, I think it's in the broad scope of that philosophy, because as I understand it the purpose of the philosophy is to avoid alerts that aren't meaningful and useful and will just annoy people. If we broke something, telling us about it definitely isn't just annoying it; it's something we need to fix.

(In an environment with sophisticated alert handling, you might want to not route these sort of alerts to people's phones and the like. We just send everything to email, and generally if we're reading email it's during working hours.)

Comments on this page:

By Perry Lorier at 2019-10-01 04:01:47:

The reasoning behind having few, high quality alerts is that alerts require maintenance too. After a while you start accreting alerts, some of which start to fire randomly and you have to debug why they fire. Are these false positives? Did your infrastructure change so the assumptions of the alert don't hold? Are they monitoring things that still matter? Are there thresholds still correct for your environment?

I've seen teams that have a philosophy of investigating every alert that didn't fire in the last calendar year with an eye to deleting it, as it's likely either tuned wrong, or broken. If it never fires, how do you know if it's actually working and not just running the moral equivalent of /bin/true? How do you test an alert? Quis custodiet ipsos custodes?

So you tend to end up replacing groups of alerts with one "larger" alert that tests more things, and is more likely to fire if anything goes wrong. For instance instead of testing that POP3 auth is working, testing that you can inject an email and be able to read that email from POP3 within a time bound. Thus testing that POP3 is up, authentication is working, POP3 mail delivery is working, and that messages can be successfully downloaded. This would cover things like a broken NFS mount, where you can auth, but you can't actually read any messages.

Perry: in testing parlance, that would be moving from unit tests toward integration tests, I guess.

By cks at 2019-10-03 17:00:36:

My impression is that in unit testing terms, Perry is basically talking about functional (end to end) tests and I'm sort of talking about integration tests, although the mapping gets odd here. As I think of it, 'can we connect to the IMAP port and get a greeting banner' is similar to a unit test, while 'can we log in' is more similar to an integration test because it requires multiple separate pieces of functionality to be working.

One way I think about it is that the more things that could be wrong if an alert check fails, the more it is like a functional test, and the fewer things, the more it is like a unit test. A really basic 'unit test' here is something like 'does machine X respond to pings'. If it doesn't ping, well, all sorts of higher level tests are probably going to also be failing, but you pretty much know why; the machine is down or otherwise off the network. Similarly, if a unit test fails, you know why higher level integration and functional tests are (hopefully) failing.

Written on 30 September 2019.
« Link: The asymmetry of Internet identity
My interest in and disappointment about HTML5's new <details> element »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Sep 30 21:35:11 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.