One simple general pattern for making sure things are alive

August 10, 2018

One perpetual problem in system monitoring is detecting when something goes away. Detecting the presence of something is often easy because it reports itself, but detecting absence is usually harder. For example, it generally doesn't work well to have some software system email you when it completes its once a day task, because the odds are only so-so that you'll actually notice on the day when the expected email isn't there in your mailbox.

One general pattern for dealing with this is what I'll call a staleness timer. In a staleness timer you have a timer that effectively slowly counts down; when the timer reaches 0, you get an alert. When systems report in that they're alive, this report resets their timer to its full value. You can implement this as a direct timer, or you can write a check that is 'if system last reported in more than X time ago, raise an alert' (and have this check run every so often).

(More generally, if you have an overall metrics system you can presumably write an alert for 'last metric from source <X> is more than <Y> old'.)

In a way this general pattern works because you've flipped the problem around. Instead of the default state being silence and exceptional things having to happen to generate an alert, the default state is an alert and exceptional things have to happen to temporarily suppress the alert.

There are all sorts of ways of making programs and systems report in, depending on what you have available and what you want to check. Traditional low rent approaches are touching files and sending email to special dedicated email aliases (which may write incoming email to a file, or simply run a program on incoming email that touches a relevant file). These can have the drawback that they depend on multiple different systems all working, but they often have the advantage that you have them working already (and sometimes it's a feature to verify all of the systems at once).

(If you have a real monitoring system, it hopefully already provides a full selection of ways to submit 'I am still alive' notifications to it. There probably is a very simple system that just does this based on netcat-level TCP messages or the like, too; it seems like the kind of thing sysadmins write every so often. Or perhaps we are just unusual in never having put together a modern, flexible, and readily customizable monitoring system.)

All of this is a reasonably obvious and well known thing around the general community, but for my own reasons I want to write it down explicitly.


Comments on this page:

By Sixty4k at 2018-08-10 13:59:25:

It’s important to document, even just for ourselves, what’s sometimes lost ‘common sense.’

By Chris D at 2018-08-11 14:45:13:

My company uses the term DeadmanCheck for this. Very, very useful technique - although sometimes sensitive to the latency of what it's checking.

By JCC at 2018-09-20 20:40:49:

A fundamental design of the Big Brother monitoring system, and more specifically its spiritual successor Xymon (formerly Hobbit), is that all messages have a TTL, with a system default to cover those that come in without one (typically 30m). An expiry time is calculated on arrival, and we walk through memory once a minute to look for old messages. All of that is kept in memory, so it's trivial to do. (Messages come in via TCP, so if the sending thinks something didn't make it it should retry.)

It leads to the surprising (for some) result that once you start sending messages (ie, for a new type of monitor), you need to keep sending them or your monitoring admin will eventually get alerted about stale (purple) messages and go after you.

It seems like such a simple approach, I'm honestly surprised more other monitoring systems don't approach things this way.

Written on 10 August 2018.
« Systemd's DynamicUser feature is (currently) dangerous
The benefits of driving automation through cron »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Aug 10 00:42:26 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.