Systemd timer units have the unfortunate practical effect of hiding errors

November 4, 2019

We've switched over to using Certbot as our Let's Encrypt. As packaged for Ubuntu in their PPA, this is set up as a modern systemd-based package. In particular, it uses a systemd timer unit to trigger its periodic certificate renewal checks, instead of a cron job (which would be installed as a file in /etc/cron.d). This weekend, the TLS certificates on one of our machines silently failed to renew on schedule (at 30 days before it would expire, so this was not anywhere close to a crisis).

Upon investigation, we discovered a setup issue that had caused Certbot to error out (and then fixed it). However, this is not a new issue; in fact, Certbot has been reporting errors since October 22nd (every time certbot.service was triggered from certbot.timer, which is twice a day). That we hadn't heard about them points out a potentially significant difference between cron jobs and systemd timers, which is that cron jobs email you their errors and output, but systemd timers quietly swallow all errors and output into the systemd journal. This is a significant operational difference in practice, as we just found out.

(Technically it is the systemd service unit associated with the timer unit.)

Had Certbot been using a cron job, we would have gotten email on the morning of October 22nd when Certbot first found problems. But since it was using a systemd timer unit, that error output went to the journal and was effectively invisible to us, lost within a flood of messages that we don't normally look at and cannot possibly routinely monitor. We only found out about the problem when the symptoms of Certbot not running became apparent, ie when a certificate failed to be renewed as expected.

Unfortunately there's no good way to fix this, at least within systemd. The systemd.exec StandardOutput= setting has many options but none of them is 'send email to', and I don't think there's any good way to add mailing the output with a simple drop-in (eg, there is no option for 'send standard output and standard error through a pipe to this other command'). Making certbot.service send us email would require a wholesale replacement of the command it runs, and at that point we might as well disable the entire Certbot systemd timer setup and supply our own cron job.

(We do monitor the status of some systemd units through Prometheus's host agent, so perhaps we should be setting an alert for certbot.service being in a failed state. Possibly among other .service units for important timer units, but then we'd have to hand-curate that list as it evolves in Ubuntu.)

PS: I think that you can arrange to get emailed if certbot.service fails, by using a drop in to add an 'OnFailure=' that starts a unit that sends email when triggered. But I don't think there's a good way to dig the actual error messages from the most recent attempt to start the service out of the journal, so the email would just be 'certbot.service failed on this host, please come look at the logs to see why'. This is an improvement, but it isn't the same as getting emailed the actual output and error messages. And I'm not sure if OnFailure= has side effects that would be undesirable.


Comments on this page:

By keur at 2019-11-05 00:17:45:

>PS: I think that you can arrange to get emailed if certbot.service fails, by using a drop in to add an 'OnFailure=' that starts a unit that sends email when triggered. But I don't think there's a good way to dig the actual error messages from the most recent attempt to start the service out of the journal, so the email would just be 'certbot.service failed on this host, please come look at the logs to see why'.

This is already documented [1] and this post is spreading misinformation. Notice how the group runs as systemd-journal, so you can access the logs of the last failed attempt.

[1] https://wiki.archlinux.org/index.php/Systemd/Timers#MAILTO

By dozzie at 2019-11-05 04:50:39:

keur: that's cute how you need to know systemd inside out to think up how to plug in your separate script for sending an e-mail on error (because OnFailure is not documented in systemd.timer(5)), a feature that cron has built in.

OnFailure is of course documented in systemd.unit(5) because it applies to all kinds of units. (Although, if it's not defined in systemd.timer(5), I would not be sure what a "failed timer" means. Maybe the timer is successful if the service is found, regardless of the service's status.)

We collect logs through a central service and have an alert set up to at least capture any "systemd service failed" alerts. We still have to go look at the log aggregator to find more details. Of course, sending logs to a third party doesn't work for everyone.

It's yet another reason to hate systemd. You always find out yet another workflow has been Changed For Your Own Good™ and have to track down all-new commands to deal with it in the middle of a crisis. And the things you can put in unit files aren't even stable across versions, so what you fix today might be broken tomorrow.

From 157.131.143.221 at 2019-11-05 12:25:37:

dozzie: It doesn't require knowing systemd inside and out because it is on the Arch Wiki. You are just bad.

By cks at 2019-11-05 13:45:51:

My own view is that while the Arch Wiki is a fine resource, it is not the same thing as systemd's actual documentation, which is in manpages installed on your system and on systemd's official site. The actual, official documentation provides neither warning nor fixes on this issue. It is nice that people have documented workarounds and fixes for this; it is not nice that they are not in the official documentation or that the out of the box state is dangerous, different from cron, and cannot be fixed inside systemd without using extra scripts.

If systemd timers are supposed to be a drop in replacement for cron.d entries, they should be as easy to operate and work as well, and do so without needing you to read third party wiki pages and assemble local scripts and local service units. Other systemd replacements for things have generally had this level of easy operation.

sapphirepaw: for my purposes, I guess 'failed timer' means both a failed .timer unit (if that can even happen) and a failed .service unit for the .timer unit. Here, what failed is the Certbot service unit; the Certbot timer worked fine (in that it triggered the service unit).

By keur at 2019-11-05 15:27:57:

cks: Yes I agree it should probably be included in the offical documentation. I think you a reaching when calling systemd timers a "drop in replacement" for cron. It doesn't say that anywhere on the man page. Even the OnCalendar isn't 1:1 with cron syntax. It is an alternative to cron, not a replacement.

By cks at 2019-11-05 17:14:13:

One of the social problems on display in Certbot's case is that developers and software packagers are evidently being led to switch from cron.d jobs to systemd timers. The Ubuntu PPA Certbot packaging ships both a cron.d job and a systemd timer unit, but it prefers the systemd timer unit (the cron job disables itself if it detects that systemd is running); the Fedora Certbot package doesn't even have a cron.d file (and it doesn't arrange to send email on failure). There is at least an implicit view floating around that systemd timer units are the modern drop in replacement for cron.d jobs and you should switch from the latter to the former. As exhibited here, they are not equivalent and there are implications for making the switch, implications which may not be clearly understood in at least some cases.

(Various non-systemd documentation about timers talks about them as a cron replacement, too, although I don't think any systemd documentation does.)

All I meant about "failed timer" was that you can apparently put OnFailure into a *.timer file, but it's not obvious what that means to systemd (on what condition would that actually run?) If you're not even supposed to put that directive in the timer file at all, apparently that's another point of confusion systemd has introduced here. IDK.

By Rfraile at 2019-11-06 16:26:33:

And other funny thing that the timers have is that you can't restart an active unit, like for example, fire a nightly restart to the foo service. Systemd timer "simply left running".

It's documented in systemd.timer, but why the narrow the .timer functionality?, what they gain with it?

Switching from cron to systemd timers is definitely an operational change.

The emphasis on emails feels like status quo bias, though. Imagine the situation was reversed: that everything was using systemd timers and then someone wrote cron and people started switching to that. In that case, there is a similar operational change. You'd switch from having a centralized status (e.g. systemctl list-units --failed) and centralized logging (the journal, which also defaults to forwarding to syslog) to crond sending emails. Is that an improvement, a step backwards, neither or both? Either way, I'd say the most important thing is that you need to integrate the new tool into your environment.

FWIW, at my work, we are in the process of converting all of our cron jobs into systemd service and timer pairs. One of the big reasons for that is that we already have systemd failed units monitored by Icinga, so this eliminates a separate way of monitoring things (emails to root) in favor of our unified alarming system. Also, emails are not great if an "every minute" or "every five minute" cron job starts failing.

We also expect other advantages. For one, service units are easier to develop & debug, as you can just start them with systemctl, without having to fiddle with the cron definition to run it at "the next minute" and then remember to change it back to the production timings when you're done. Also, systemd timers can be randomized to spread the load rather than having every system wake up at the same moment and start running e.g. daily jobs. (I'm aware that cronie has RANDOM_DELAY, but Debian & Ubuntu still use Debian's vixie-cron which does not.)

Time will tell if this was a good idea or not. Assuming this goes well for us, the next phase will be to switch from crond to systemd-cron, a (third-party) systemd generator that creates service and timer units from crontabs. This will dynamically convert any package cron jobs.

If emails are what you want, systemd timers are definitely a step backwards in that regard. Emails can be done, and systemd-cron has a setup for them for the units it converts, but it is additional work. And for timer-triggered services that are provided by distro packages (i.e. not you), while you can use drop-in config files to add the relevant configuration, you have to do that per-service. This is extra work, and more importantly, you have to know about all such units you have installed, which does not scale.

Another, more general, option would be to wire up something to check for failed units and send an email based on that.

Written on 04 November 2019.
« Many of our 'worklog' messages currently assume a lot of context
Systemd needs official documentation on best practices »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Nov 4 23:02:04 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.