Systemd timer units have the unfortunate practical effect of hiding errors
We've switched over to using Certbot
as our Let's Encrypt. As packaged for
Ubuntu in their PPA, this is
set up as a modern systemd-based package. In particular, it uses
a systemd timer unit to
trigger its periodic certificate renewal checks, instead of a cron
job (which would be installed as a file in /etc/cron.d
). This
weekend, the TLS certificates on one of our machines silently failed
to renew on schedule (at 30 days before it would expire, so this
was not anywhere close to a crisis).
Upon investigation, we discovered a setup issue that had caused
Certbot to error out (and then fixed it). However, this is not a
new issue; in fact, Certbot has been reporting errors since October
22nd (every time certbot.service
was triggered from certbot.timer
,
which is twice a day). That we hadn't heard about them points out
a potentially significant difference between cron jobs and systemd
timers, which is that cron jobs email you their errors and output,
but systemd timers quietly swallow all errors and output into the
systemd journal. This is a significant operational difference in
practice, as we just found out.
(Technically it is the systemd service unit associated with the timer unit.)
Had Certbot been using a cron job, we would have gotten email on the morning of October 22nd when Certbot first found problems. But since it was using a systemd timer unit, that error output went to the journal and was effectively invisible to us, lost within a flood of messages that we don't normally look at and cannot possibly routinely monitor. We only found out about the problem when the symptoms of Certbot not running became apparent, ie when a certificate failed to be renewed as expected.
Unfortunately there's no good way to fix this, at least within
systemd. The systemd.exec
StandardOutput=
setting has many options but none of them is 'send email to', and
I don't think there's any good way to add mailing the output with
a simple drop-in (eg, there is no option for 'send standard output
and standard error through a pipe to this other command'). Making
certbot.service
send us email would require a wholesale replacement
of the command it runs, and at that point we might as well disable
the entire Certbot systemd timer setup and supply our own cron job.
(We do monitor the status of some systemd units through Prometheus's
host agent, so perhaps
we should be setting an alert for certbot.service
being in a
failed state. Possibly among other .service
units for important
timer units, but then we'd have to hand-curate that list as it
evolves in Ubuntu.)
PS: I think that you can arrange to get emailed if certbot.service
fails, by using a drop in to add an 'OnFailure=
'
that starts a unit that sends email when triggered. But I don't
think there's a good way to dig the actual error messages from the
most recent attempt to start the service out of the journal, so the
email would just be 'certbot.service
failed on this host, please
come look at the logs to see why'. This is an improvement, but it
isn't the same as getting emailed the actual output and error
messages. And I'm not sure if OnFailure=
has side effects that
would be undesirable.
Comments on this page:
|
|