2019-09-30
Using alerts as tests that guard against future errors
On Twitter, I said:
These days, I think of many of our alerts as tests, like code tests to verify that bugs don't come back. If we broke something in the past and didn't notice or couldn't easily spot what was wrong, we add an alert (and a metric or check for it to use, if necessary).
So we have an alert for 'can we log in with POP3' (guess what I broke once, and surprise, GMail uses POP3 to pull email from us), and one for 'did we forget to commit this RCS file and broke self-serve device registration', and so on.
(The RCS file alert is a real one; I mentioned it here.)
In modern programming, it's conventional that when you find a bug in your code, you usually write a test that checks for it (before you fix the bug). This test is partly to verify that you actually fixed the bug, but it's also there to guard against the bug ever coming back; after all, if you got it wrong once, you might accidentally get it wrong again in the future. You can find a lot of these tests over modern codebases, especially in tricky areas, and if you read the commit logs you can usually find people saying exactly this about the newly added tests.
As sysadmins here, how we operate our systems isn't exactly programming, but I think that some of the same principles apply. Like programmers, we're capable of breaking things or setting up something that is partially but not completely working. When that happens, we can fix it (like programmers fixing a bug) and move on, or we can recognize that if we made a mistake once, we might make the same mistake later (or a similar one that has the same effects), just like issues in programs can reappear.
(If anything, I tend to think that traditional style sysadmins are more prone to re-breaking things than programmers are because we routinely rebuild our 'programs', ie our systems, due to things like operating systems and programs getting upgraded. Every new version of Ubuntu and its accompanying versions of Dovecot, Exim, Apache, and so on is a new chance to recreate old problems, and on top of that we tend to build things with complex interdependencies that we often don't fully understand or realize.)
In this environment, my version of tests has become alerts. As I said in the tweets, if we broke something in the past and didn't notice, I'll add an alert for it to make sure that if we do it again, we'll find out right away this time around. Just as with the tests that programmers add, I don't expect these alerts to ever fire, and certainly not very often; if they do fire frequently, then either they're bad (just as tests can be bad) or we have a process problem, where we need to change how we operate so we stop making this particular mistake so often.
This is somewhat of a divergence from the usual modern theory of alerts, which is that you should have only a few alerts and they should mostly be about things that cause people pain. However, I think it's in the broad scope of that philosophy, because as I understand it the purpose of the philosophy is to avoid alerts that aren't meaningful and useful and will just annoy people. If we broke something, telling us about it definitely isn't just annoying it; it's something we need to fix.
(In an environment with sophisticated alert handling, you might want to not route these sort of alerts to people's phones and the like. We just send everything to email, and generally if we're reading email it's during working hours.)
Link: The asymmetry of Internet identity
David Crawshaw's The asymmetry of Internet identity is, among other things, a marvelously cynical yet pretty much completely accurate review of people's identity on the modern Internet in practice. I wish it didn't work the way Crawshaw describes it, but it does.
(Via lobste.rs.)
ZFS performance really does degrade as you approach quota limits
Every so often (currently monthly), there is an "OpenZFS leadership meeting". What this really means is 'lead developers from the various ZFS implementations get together to talk about things'. Announcements and meeting notes from these meetings get sent out to various mailing lists, including the ZFS on Linux ones. In the September meeting notes, I read a very interesting (to me) agenda item:
- Relax quota semantics for improved performance (Allan Jude)
- Problem: As you approach quotas, ZFS performance degrades.
- Proposal: Can we have a property like quota-policy=strict or loose, where we can optionally allow ZFS to run over the quota as long as performance is not decreased.
(The video of the September meeting is here and the rolling agenda document is here; you want the 9/17 portion.)
This is very interesting to me because of two reasons. First, in
the past we have definitely seen significant problems on our OmniOS
machines, both when an entire pool hits a quota
limit and when a single filesystem hits a
refquota
limit. It's nice to know
that this wasn't just our imagination and that there is a real issue
here. Even better, it might someday be improved (and perhaps in a
way that we can use at least some of the time).
Second, any number of people here run very close to and sometimes at the quota limits of both filesystems and pools, fundamentally because people aren't willing to buy more space. We have in the past assumed that this was relatively harmless and would only make people run out of space. If this is a known issue that causes serious performance degradation, well, I don't know if there's anything we can do, but at least we're going to have to think about it and maybe push harder at people. The first step will have to be learning the details of what's going on at the ZFS level to cause the slowdown.
(It's apparently similar to what happens when the pool is almost full, but I don't know the specifics of that either.)
With that said, we don't seem to have seen clear adverse effects on our Linux fileservers, and they've definitely run into quota limits (repeatedly). One possible reason for this is that having lots of RAM and SSDs makes the effects mostly go away. Another possible reason is that we haven't been looking closely enough to see that we're experiencing global slowdowns that correlate to filesystems hitting quota limits. We've had issues before with somewhat subtle slowdowns that we didn't understand (cf), so I can't discount that we're having it happen again.