The benefits of driving automation through cron

August 10, 2018

In light of our problem with timesyncd, we needed a different (and working) solution for time synchronization on our Ubuntu 18.04 machines. The obvious solution would have been to switch over to chrony; Ubuntu even has chrony set up so that if you run it, timesyncd is automatically blocked. I like chrony so I was tempted by this idea briefly, but then I realized that using chrony would mean having yet another daemon that we have to care about. Instead, our replacement for timesyncd is running ntpdate from cron.

There are a number of quiet virtues of driving automation out of cron entries. The whole approach is simple and brute force, but this creates a great deal of reliability. Cron basically never dies and if it were ever to die it's so central to how our systems operate that we'd probably notice fairly fast. If we're ever in any doubt, cron logs when it runs things to syslog (and thus to our central syslog server), and if jobs fail or produce output, cron has a very reliable and well tested system for reporting that to us. A simple cron entry that runs ntpdate has no ongoing state that can get messed up, so if cron is running at all, the ntpdate is running at its scheduled interval and so our clocks will stay synchronized. If something goes wrong on one run, it doesn't really matter because cron will run it again later. Network down temporarily? DNS resolution broken? NTP servers unhappy? Cure the issue and we'll automatically get time synchronization back.

A cron job is simple blunt force; it repeats its activities over and over and over again, throwing itself at the system until it batters its way through and things work. Unless you program it otherwise, it's stateless and so indifferent to what happened the last time around. There's a lot to be said for this in many system tasks, including synchronizing the clock.

(Of course this can be a drawback if you have a cron job that's failing and generating email every failure, when you'd like just one email on the first failure. Life is not perfect.)

There's always a temptation in system administration to make things complicated, to run daemons and build services and so on. But sometimes the straightforward brute force way is the best answer. We could run a NTP daemon on our Ubuntu machines, and on a few of them we probably will (such as our new fileservers), but for everything else, a cron job is the right approach. Probably it's the right approach for some of our other problems, too.

(If timesyncd worked completely reliably on Ubuntu 18.04, we would likely stick with it simply because it's less work to use the system's default setup. But since it doesn't, we need to do something.)

PS: Although we don't actively monitor cron right now, there are ways to notice if it dies. Possibly we should add some explicit monitoring for cron on all of our machines, given how central it is to things like our password propagation system. Sure, we'd notice sooner or later anyway, but noticing sooner is good.


Comments on this page:

By yuri at 2018-08-10 23:46:18:

Have you ever tried systemd timers? I don't think you even need to care about it dying then, there's this systemd-cron package to allow in-place conversion.

One of my favorite monitoring tools around cron tasks (as well as with systemd timers) is https://healthchecks.io/. This is distinctly within the realm of your previous post, and makes for a very useful complementary monitoring tool.

Are you taking care to use the -B option of ntpdate so that it only slews, not steps? A backwards time step can be really confusing for some applications.

By cks at 2018-08-11 07:22:49:

We're not using ntpdate's -B option because for us it's more important that the system's time be corrected relatively rapidly in the face of a large adjustment, even if this results in a time jump as seen by user-run applications. This is in large part a product of our unusual sysadmin environment, where we don't really have or run applications. Most everything that's happening on our systems is short-term processes (which mostly won't notice a time jump because they aren't running during it), and the rest are long term user-run things, which just get to deal with it.

(In practice most of the long-running processes around here are user compute jobs, which don't care either; they're just churning through their calculations as fast as possible.)

Written on 10 August 2018.
« One simple general pattern for making sure things are alive
Fetching really new Fedora packages with Bodhi »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Aug 10 13:37:44 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.