2023-04-03
Automated status tests need to have little or no 'noise'
In a comment on my entry on how you should automate some basic restore testing of your backups, Simon made a perfectly reasonable suggestion:
Another relatively basic, but useful, check is to do some plausibility check on the backup size. For most systems huge jumps in backup size (in either direction) likely mean you are not backing up what you are thinking. This is a bit more complicated than what you mention in your article, since it needs some tuning and can generate false positives. But I think it still can be a valuable check that in most cases won't be too hard to implement.
I've come to believe that automated system status tests, and automated system things in general, need to have little or no 'noise'. By noise I mean errors, warnings, alerts, or messages that happen when there isn't actually any problem. The fundamental reason for this comes down to the problem of false positives with large numbers.
Unless you're in a bad situation, your systems are almost always working; your backups are happening properly, your machines are up, your disks aren't dangerously full, and so on. Actual failures are a small percentage of the time. This means that even with a very low false positive rate on a percentage basis, almost all of the time your tests are raising some sort of alert, they're giving you noise, a false alert. This gives you the security alert problem; it will be very easy to get habituated to ignoring or downplaying the warning messages. The very rare occasion when they're warning you about a real problem will be drowned out by the noise of non-problems.
As a system administrator, it can feel morally wrong to not send out a warning if we detect something potentially broken. But here again, the perfect is the enemy of the good. It's better to reliably generate warnings that people will notice and heed when something is definitely broken, even if this doesn't send warnings in some situations when you can't be sure.
This isn't a new thing and this isn't unique to system administrators. Programmers have their own version of this for linters, compiler warnings, dependency monitoring, tests (unit and otherwise), and so on. For all of these, programmers have lots of painful experience saying that noisy things are often not worth it because they'll hide actual problems.
(If you get 200 compiler warnings today, you'll probably not spot the ten critical ones, or notice that tomorrow you have 202 compiler warnings.)