Flaky alerts are telling you something

May 26, 2024

Sometimes, monitoring and alerting systems have flaky alerts, either in the form of flapping alerts (where the alert will repeatedly trigger and then go away) or alerts that go off when there is no problem. Broadly speaking, these flaky alerts aren't just noise; they're telling you something.

To put it one way, flaky monitoring system alerts are like flaky tests in programming. Each of these is telling you that your understanding of things is incorrect or that something odd and unusual is going on, and sometimes both. This comes about because you don't generally create either alerts or tests intending them to be flaky; you intend for them to work (or sometimes for tests, to reliably fail before you fix things). If the result of your work is flaky, either you didn't correctly understand how your system (or your code) behaves when you did your work, such that you aren't actually testing what you think you're testing, or there is something going on that genuinely causes unexpected sporadic failures.

(For example, our discovery of OpenSSH sshd's 'MaxStartups' setting came from investigating a 'flaky' alert.)

In both flaky alerts and flaky tests, you can deal with the noise by either disabling the alert or test, or by making it 'try harder' in some way (for alerts this is often 'make this condition have to be true for longer than before'). However, this doesn't change the underlying reality of what is happening, nor does it improve your understanding of the system (at least, not beyond a superficial level of 'I was wrong that this is a reliable signal of ...'). There are obvious drawbacks to this non-approach to the underlying issues.

This doesn't mean that every flaky alert deserves a deep investigation. Sometimes the range of things that might be misunderstood or going wrong is not important enough to justify an investigation. And even if you plan an investigation, it's perfectly reasonable to remove the alert until then, or de-flake it with various 'try harder' brute force mechanisms. For that matter, it's okay to remove a flaky alert if you simply have higher priorities right now. If the flaky alert is trying to tell you about something serious, sooner or later it will probably escalate to obvious, non-flaky symptoms.

(This isn't necessarily how programmers should deal with flaky tests, but system administration is in part an art of tradeoffs. We can never do everything, so we need to pick the important somethings.)

Written on 26 May 2024.
« Reasons to not expose Go's choice of default TLS ciphers
Some notes on Grafana Loki's new "structured metadata" (as of 3.0.x) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun May 26 22:42:27 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.