Wandering Thoughts archives

2019-11-04

Systemd timer units have the unfortunate practical effect of hiding errors

We've switched over to using Certbot as our Let's Encrypt. As packaged for Ubuntu in their PPA, this is set up as a modern systemd-based package. In particular, it uses a systemd timer unit to trigger its periodic certificate renewal checks, instead of a cron job (which would be installed as a file in /etc/cron.d). This weekend, the TLS certificates on one of our machines silently failed to renew on schedule (at 30 days before it would expire, so this was not anywhere close to a crisis).

Upon investigation, we discovered a setup issue that had caused Certbot to error out (and then fixed it). However, this is not a new issue; in fact, Certbot has been reporting errors since October 22nd (every time certbot.service was triggered from certbot.timer, which is twice a day). That we hadn't heard about them points out a potentially significant difference between cron jobs and systemd timers, which is that cron jobs email you their errors and output, but systemd timers quietly swallow all errors and output into the systemd journal. This is a significant operational difference in practice, as we just found out.

(Technically it is the systemd service unit associated with the timer unit.)

Had Certbot been using a cron job, we would have gotten email on the morning of October 22nd when Certbot first found problems. But since it was using a systemd timer unit, that error output went to the journal and was effectively invisible to us, lost within a flood of messages that we don't normally look at and cannot possibly routinely monitor. We only found out about the problem when the symptoms of Certbot not running became apparent, ie when a certificate failed to be renewed as expected.

Unfortunately there's no good way to fix this, at least within systemd. The systemd.exec StandardOutput= setting has many options but none of them is 'send email to', and I don't think there's any good way to add mailing the output with a simple drop-in (eg, there is no option for 'send standard output and standard error through a pipe to this other command'). Making certbot.service send us email would require a wholesale replacement of the command it runs, and at that point we might as well disable the entire Certbot systemd timer setup and supply our own cron job.

(We do monitor the status of some systemd units through Prometheus's host agent, so perhaps we should be setting an alert for certbot.service being in a failed state. Possibly among other .service units for important timer units, but then we'd have to hand-curate that list as it evolves in Ubuntu.)

PS: I think that you can arrange to get emailed if certbot.service fails, by using a drop in to add an 'OnFailure=' that starts a unit that sends email when triggered. But I don't think there's a good way to dig the actual error messages from the most recent attempt to start the service out of the journal, so the email would just be 'certbot.service failed on this host, please come look at the logs to see why'. This is an improvement, but it isn't the same as getting emailed the actual output and error messages. And I'm not sure if OnFailure= has side effects that would be undesirable.

linux/SystemdTimersAndErrors written at 23:02:04; Add Comment

Many of our 'worklog' messages currently assume a lot of context

The primary way we keep track of things around here is our worklog system, which is a specific email log both of changes that we make and of how to do things (such as rebuild systems). Also, a while back I wrote about how keeping your past checklists doesn't help unless you can find them. In the process of recovering the checklist from 2015 that I was looking for, I wound up re-reading a bunch of our worklog messages from around that time, which gave me a learning lesson.

What I learned from re-reading our old messages is that most of our worklog messages assume a lot of implicit context. This makes perfect sense, since we write and send in our messages when the context is fresh in everyone's minds; it is the water that we're currently swimming in, and we're as generally oblivious to it as fish are. But looking back a few years later, that context is all gone. When I re-read our worklog messages I had to carefully rebuild the context, which I fortunately mostly could by reading enough messages from both worklog and our internal sysadmin mailing list (which is also archived, and which we try to discuss most everything on or at least mail in summaries of in-person discussions).

I don't think that we want to try to put down that implicit context in every worklog email; that would be tedious both to write and to read. But for worklog messages which we expect to refer to much later, for example as 'how we did this last time' directions, putting in much more explicit context and explanations seems like a good idea. Of course this is a bit tricky to actually do in practice for two reasons. The first is that what context needs to explained for the future isn't necessarily clear to us, since we're immersed in it. The second is that it's not always clear what worklog messages we're going to want to refer back to in general. Some things we can predict, but others may look like one-off things until they come up again. Still, I can try, and I should, especially for big things that took a lot of planning.

(This is similar to the lessons I learned from some planned power shutdowns many years ago, in part 1 and then part 2. We've had some other power shutdowns since then, and the lessons have been useful and I've tried to carry them out.)

PS: Just in general it might not hurt to put a bit more context into my worklogs, if only in the form of a link to the first iteration of any particular thing, where I'm likely to have written out a bunch of the background. Especially if I put it in as a postscript, it's easy to skip most of the time.

sysadmin/WorklogsAssumeContext written at 00:17:19; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.