The brute force cron-based way of flexibly timed repeated alerts

December 5, 2018

Suppose, not hypothetically, that you have a cron job that monitors something important. You want to be notified relatively fast if your Prometheus server is down, so you run your cron job frequently, say once every ten minutes. However, now we have the problem that cron is stateless, so if our Prometheus server goes down and our cron job starts alerting us, it will re-alert us every ten minutes. This is too much noise (at least for us).

There's a standard pattern for dealing with this in cron jobs that send alerts; once the alert happens, you create a state file somewhere and as long as your current state is the same as the state file, you don't produce any output or send out your warning or whatever. But this leads to the next problem, which is that you alert once and are then silent forever afterward, leaving it to people to remember that the problem (still) exists. It would be better to re-alert periodically, say once every hour or so. This isn't too hard to do; you can check to see if the state file is more than an hour old and just re-send the alert if it is.

(One way to do this is with 'find <file> -mmin +... -print'. Although it may not be Unixy, I do rather wish for olderthan and newerthan utilities as a standard and widely available thing. I know I can write them in a variety of ways, but it's not the same.)

But this isn't really what we want, because we aren't around all of the time. Re-sending the alert once an hour in the middle of the night or the middle of the weekend will just give us a big pile of junk email to go through when we get back in to the office; instead we want repeats only once every hour or two during weekdays.

When I was writing our checker script, I got to this point and started planning out how I was going to compare against the current hour and day of weeek in the script to know when I should clear out the state file and so on. Then I had a flash of the obvious and realized that I already had a perfectly good tool for flexibly specifying various times and combinations of time conditions, namely cron itself. The simple way to reset the state file and cause re-alerts at whatever flexible set of times and time patterns I want is to do it through crontab entries.

So now I have one cron entry that runs every ten minutes for the main script, and another cron entry that clears the state file (if it exists) several times a day during the weekday. If we decide we want to be re-notified once a day during the weekend, that'll be easy to add as another cron entry. As a bonus, everyone here understands cron entries, so it will be immediately obvious when things run and what they do in a way that it wouldn't be if all of this was embedded in a script.

(It's also easy for anyone to change. We don't have to reach into a script; we just change crontab lines, something we're already completely familiar with.)

As it stands this is slightly too simplistic, because it clears the state file without caring how old it is. In theory we could generate an alert shortly before the state file is due to cleared, clear the state file, and then immediately re-alert. To deal with that I decided to go the extra distance and only clear the state file if it was at least a minimum age (using find to see if it was old enough, because we make do with the tools Unix gives us).

(In my actual implementation, the main script takes a special argument that makes it just clear the state file. This way only the script has to know where the state file is or even just what to do to clear the 'do not re-alert' state; the crontab entry just runs 'check-promserver --clear'.)

Comments on this page:

It's pretty easy to set up two Prometheus servers to monitor each other. Just an option to consider.

Written on 05 December 2018.
« Wget is not welcome here any more (sort of)
Some basic ZFS ARC statistics and prefetching »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Dec 5 01:23:02 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.