2019-11-04
Systemd timer units have the unfortunate practical effect of hiding errors
We've switched over to using Certbot
as our Let's Encrypt. As packaged for
Ubuntu in their PPA, this is
set up as a modern systemd-based package. In particular, it uses
a systemd timer unit to
trigger its periodic certificate renewal checks, instead of a cron
job (which would be installed as a file in /etc/cron.d
). This
weekend, the TLS certificates on one of our machines silently failed
to renew on schedule (at 30 days before it would expire, so this
was not anywhere close to a crisis).
Upon investigation, we discovered a setup issue that had caused
Certbot to error out (and then fixed it). However, this is not a
new issue; in fact, Certbot has been reporting errors since October
22nd (every time certbot.service
was triggered from certbot.timer
,
which is twice a day). That we hadn't heard about them points out
a potentially significant difference between cron jobs and systemd
timers, which is that cron jobs email you their errors and output,
but systemd timers quietly swallow all errors and output into the
systemd journal. This is a significant operational difference in
practice, as we just found out.
(Technically it is the systemd service unit associated with the timer unit.)
Had Certbot been using a cron job, we would have gotten email on the morning of October 22nd when Certbot first found problems. But since it was using a systemd timer unit, that error output went to the journal and was effectively invisible to us, lost within a flood of messages that we don't normally look at and cannot possibly routinely monitor. We only found out about the problem when the symptoms of Certbot not running became apparent, ie when a certificate failed to be renewed as expected.
Unfortunately there's no good way to fix this, at least within
systemd. The systemd.exec
StandardOutput=
setting has many options but none of them is 'send email to', and
I don't think there's any good way to add mailing the output with
a simple drop-in (eg, there is no option for 'send standard output
and standard error through a pipe to this other command'). Making
certbot.service
send us email would require a wholesale replacement
of the command it runs, and at that point we might as well disable
the entire Certbot systemd timer setup and supply our own cron job.
(We do monitor the status of some systemd units through Prometheus's
host agent, so perhaps
we should be setting an alert for certbot.service
being in a
failed state. Possibly among other .service
units for important
timer units, but then we'd have to hand-curate that list as it
evolves in Ubuntu.)
PS: I think that you can arrange to get emailed if certbot.service
fails, by using a drop in to add an 'OnFailure=
'
that starts a unit that sends email when triggered. But I don't
think there's a good way to dig the actual error messages from the
most recent attempt to start the service out of the journal, so the
email would just be 'certbot.service
failed on this host, please
come look at the logs to see why'. This is an improvement, but it
isn't the same as getting emailed the actual output and error
messages. And I'm not sure if OnFailure=
has side effects that
would be undesirable.
Many of our 'worklog' messages currently assume a lot of context
The primary way we keep track of things around here is our worklog system, which is a specific email log both of changes that we make and of how to do things (such as rebuild systems). Also, a while back I wrote about how keeping your past checklists doesn't help unless you can find them. In the process of recovering the checklist from 2015 that I was looking for, I wound up re-reading a bunch of our worklog messages from around that time, which gave me a learning lesson.
What I learned from re-reading our old messages is that most of our worklog messages assume a lot of implicit context. This makes perfect sense, since we write and send in our messages when the context is fresh in everyone's minds; it is the water that we're currently swimming in, and we're as generally oblivious to it as fish are. But looking back a few years later, that context is all gone. When I re-read our worklog messages I had to carefully rebuild the context, which I fortunately mostly could by reading enough messages from both worklog and our internal sysadmin mailing list (which is also archived, and which we try to discuss most everything on or at least mail in summaries of in-person discussions).
I don't think that we want to try to put down that implicit context in every worklog email; that would be tedious both to write and to read. But for worklog messages which we expect to refer to much later, for example as 'how we did this last time' directions, putting in much more explicit context and explanations seems like a good idea. Of course this is a bit tricky to actually do in practice for two reasons. The first is that what context needs to explained for the future isn't necessarily clear to us, since we're immersed in it. The second is that it's not always clear what worklog messages we're going to want to refer back to in general. Some things we can predict, but others may look like one-off things until they come up again. Still, I can try, and I should, especially for big things that took a lot of planning.
(This is similar to the lessons I learned from some planned power shutdowns many years ago, in part 1 and then part 2. We've had some other power shutdowns since then, and the lessons have been useful and I've tried to carry them out.)
PS: Just in general it might not hurt to put a bit more context into my worklogs, if only in the form of a link to the first iteration of any particular thing, where I'm likely to have written out a bunch of the background. Especially if I put it in as a postscript, it's easy to skip most of the time.