In practice, 'alerts' can have different meanings in different organizations

September 2, 2023

One of the things I've become more and more aware of over time as I talk about our metrics, monitoring, and alerting system is that what 'alerts' are can vary quite a bit between environments, despite everyone using the same term and often the same technology to implement their particular form of 'alerts'. Some of the difference in what alerts mean is technological and some of it is organizational (or 'operational').

The first big difference, which is partly technology (in how alerts are delivered) and partly organizational, is whether 'alerts' must be noticed and responded to outside of regular working hours. In other words, whether or not alerts can wake people up in the middle of the night. If alerts can, then you want to be very sure that these alerts genuinely matter; you will, for example, have good reason to only alert on problems in 'user journeys' and defer notifications about anything else to regular working hours in some way or another.

(Well, you don't have to, but if you insist on waking people up in the middle of the night for everything, pretty soon you won't have very many people left to wake up. Especially good people.)

The second difference is how much alerts interrupt people and require them to interrupt their work. For example, if there is a policy where all alerts must be acknowledged and investigated promptly, even if they're 'during regular working hours' alerts. This pushes people to make alerts visible and interrupting, and requires people to interrupt their work to investigate them. Again, this drives sensible places to make sure that 'alerts' really matter and to err on the side of doing more work in alert setup to be sure of that.

(Sometimes people have different sorts of 'alerts' where only some sorts (eg, 'P1 alerts') require 24/7 response or immediate action.)

We are a relatively extreme version of the other side of both of these differences. For us, alerts in general are not 'you must pay attention to this 24/7' but mostly 'here is something you probably want to look at'. Of course sometimes the thing we probably want to look at is 'something exploded' and we're going to jump on that, but there's no requirement that we look into all alerts immediately. Our alerts are designed to be quiet but that isn't because they page us in the middle of the night, it's because we want to keep our email volume down and avoid alert fatigue where we ignore alert emails and so miss more important issues.

My sense is that most places today have 'alerts' that at least sometimes are on the 'wake people up and/or interrupt their work' end of things, so that when you talk about 'alerts' in general, this is what most people assume. I don't think we have a good, well understood term for the less intense sort of alerts. In the past I've called them 'notifications', but I suspect that people wouldn't necessarily understand what I meant by that if, for example, I talked about 'how we use Prometheus to drive our notifications' (instead of 'alerts'). There's also the issue that a lot of our technology for this specifically talks about 'alerts' and things like 'alert rules' (cf). It's hard to write about this in a clear way without using 'alert' and 'alerts'.

Written on 02 September 2023.
« Alerting on high level 'user stories' failing doesn't work in all setups
TLS CA root certificate name constraints for internal CAs »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Sep 2 22:32:56 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.