Systemd timer units have the unfortunate practical effect of hiding errors

November 4, 2019

We've switched over to using Certbot as our Let's Encrypt. As packaged for Ubuntu in their PPA, this is set up as a modern systemd-based package. In particular, it uses a systemd timer unit to trigger its periodic certificate renewal checks, instead of a cron job (which would be installed as a file in /etc/cron.d). This weekend, the TLS certificates on one of our machines silently failed to renew on schedule (at 30 days before it would expire, so this was not anywhere close to a crisis).

Upon investigation, we discovered a setup issue that had caused Certbot to error out (and then fixed it). However, this is not a new issue; in fact, Certbot has been reporting errors since October 22nd (every time certbot.service was triggered from certbot.timer, which is twice a day). That we hadn't heard about them points out a potentially significant difference between cron jobs and systemd timers, which is that cron jobs email you their errors and output, but systemd timers quietly swallow all errors and output into the systemd journal. This is a significant operational difference in practice, as we just found out.

(Technically it is the systemd service unit associated with the timer unit.)

Had Certbot been using a cron job, we would have gotten email on the morning of October 22nd when Certbot first found problems. But since it was using a systemd timer unit, that error output went to the journal and was effectively invisible to us, lost within a flood of messages that we don't normally look at and cannot possibly routinely monitor. We only found out about the problem when the symptoms of Certbot not running became apparent, ie when a certificate failed to be renewed as expected.

Unfortunately there's no good way to fix this, at least within systemd. The systemd.exec StandardOutput= setting has many options but none of them is 'send email to', and I don't think there's any good way to add mailing the output with a simple drop-in (eg, there is no option for 'send standard output and standard error through a pipe to this other command'). Making certbot.service send us email would require a wholesale replacement of the command it runs, and at that point we might as well disable the entire Certbot systemd timer setup and supply our own cron job.

(We do monitor the status of some systemd units through Prometheus's host agent, so perhaps we should be setting an alert for certbot.service being in a failed state. Possibly among other .service units for important timer units, but then we'd have to hand-curate that list as it evolves in Ubuntu.)

PS: I think that you can arrange to get emailed if certbot.service fails, by using a drop in to add an 'OnFailure=' that starts a unit that sends email when triggered. But I don't think there's a good way to dig the actual error messages from the most recent attempt to start the service out of the journal, so the email would just be 'certbot.service failed on this host, please come look at the logs to see why'. This is an improvement, but it isn't the same as getting emailed the actual output and error messages. And I'm not sure if OnFailure= has side effects that would be undesirable.

Written on 04 November 2019.
« Many of our 'worklog' messages currently assume a lot of context
Systemd needs official documentation on best practices »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Nov 4 23:02:04 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.