Wandering Thoughts archives

2023-07-30

Our alerts and monitoring can never be comprehensive

A while ago I wrote about how an obvious problem isn't necessarily obvious, where one thing I said was that in many situations, there are too many obvious problem causes for people to keep track of them all. A corollary to this is that there are too many things that could go wrong on your systems to monitor and alert on all of them. In fact, I'm not convinced that we could even identify all of the possible things that could go wrong. Among other issues, systems can fail in many, many different ways.

(For example, a while back we had an incident where our internal forwarding DNS resolvers stopped resolving internal names. While we might have been able to imagine this failure in the abstract, I'm relatively sure we couldn't have predicted all of the things that failed, and also some of the things that didn't fail. Sometimes this involves specialized knowledge, such as knowing which Prometheus metrics collection connections are persistent.)

One of the things this means is that at a certain point, I need to put down the keyboard and stop writing new alert rules. I can imagine a lot of failures that I could alert on (while making sure that the alert isn't noisy), but since I can never be comprehensive, the real question is whether the payoff from one more alert is worth the extra time, complexity, and so on. Sometimes it is, for example because alerts remember things for us, or we've experienced a particular failure before (for example, we've now improved our DNS resolver monitoring and we have alerts for too-low network link speeds). But there are a lot of alerts that I could write but I won't, not now, not until we have a clear need for them.

Another thing this means is that we can't expect to get an alert every time something goes wrong. If things fail in a way that we didn't think of or didn't think it was worth writing a specific alert for, at best we're going to get indirect alerts (for symptoms created by the problem) and at worst we're going to get no alerts. The corollary to this is that we shouldn't blame ourselves afterward for not having an alert; that would be hindsight bias.

(Sometimes this is a hard thing for system administrators to accept. We want to think that we're perfectly monitoring our systems and we've covered everything we could. But that's never going to be true. Monitoring and alerting is always full of tradeoffs and limitations.)

sysadmin/AlertsNeverComprehensive written at 22:13:39;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.