One of the things a metrics system does is handle state for you
Over on Mastodon, I said:
Belated obvious realization: using a metrics system for alerts instead of hand-rolled checks means that you can outsource handling state to your metrics systems and everything else can be stateless. Want to only alert after a condition has been true for an hour? Your 'check the condition' script doesn't have to worry about that; you can leave it to the metrics system.
This sounds abstract, so let me make it concrete. We have some self serve registration portals that work on configuration files that are automatically checked into RCS every time the self-serve systems do something. As a safety measure, the automated system refuses to do anything if the file is either locked or has uncommitted changes; if it touches the file, it might collide with other things being done to it. These files can also be hand-edited, for example to remove an entry, and when we do this we don't always remember that we have to commit the file.
(Or we may be distracted, because we are trying to work fast to lock a compromised account as soon as possible.)
Recently, I was planning out how to detect this situation and send out alerts for it. Given that we have a Prometheus based metrics and alerting system, one approach is to have a hand rolled script that generates an 'all is good' or 'we have problems' metric, feed that into Prometheus, let Prometheus grind it through all of the gears of alert rules and so on, and wind up with Alertmanager sending us email. But this seems like a lot of extra work just to send email, and it requires a new alert rule, and so on. Using Prometheus also constrains what additional information we can put in the alert email, because we have to squeeze it all through the narrow channel of Prometheus metrics, the information that an alert rule has readily available, and so on. At first blush, it seemed simpler to just have the hand rolled checking script send the email itself, which would also let the email message be completely specific and informative.
But then I started thinking about that in more detail. We don't want the script to be hair trigger, because it might run while we were in the middle of editing things (or the automated system was making a change); we need to wait a bit to make sure the problem is real. We also don't want to send repeat emails all the time, because it's not that critical (the self-serve registration portals aren't used very frequently). Handling all of this requires state, and that means something has to handle that state. You can handle state in scripts, but it gets complicated. The more I thought about it, the more attractive it was to let Prometheus handle all of that; it already has good mechanisms for 'only trigger an alert if it's been true for X amount of time' and 'only send email every so often' and so on, and it's worried about more corner cases than I have.
The great advantage of feeding 'we have a problem/we have no problem' indications into the grinding maw of Prometheus merely to have it eventually send us alert email is that the metrics system will handle state for us. The extra custom things that we need to write, our highly specific checks and so on, are spared from worrying about all of those issues, which makes them simpler and more straightforward. To use jargon, the metrics system has enabled a separation of concerns.
PS: This isn't specific to Prometheus. Any metrics and alerting system has robust general features to handle most or even all of these issues. And Prometheus itself is not perfect; for example, it's awkward at best to set up alerts that trigger only between certain times of the day or on certain days of the week.