Some notes on getting email when your systemd timer services fail

November 7, 2019

Suppose, not hypothetically, that you have some things that are implemented through systemd timers instead of traditional cron.d jobs, and you would like to get email if and when they fail. The lack of this email by default is one of the known issues with turning cron.d entries into systemd timers and people have already come up with ways to do this with systemd tricks, so for full details I will refer you to the Arch Wiki section on this (brought to my attention by keur's comment on my initial entry) and this serverfault question and its answers (via @tvannahl on Twitter). This entry is my additional notes from having set this up for our Certbot systemd timers.

Systemd timers come in two parts; a .timer unit that controls timing and a .service unit that does the work. What we generally really care about is the .service unit failing. To detect this and get email about it, we add an OnFailure= to the timer's .service unit that triggers a specific instance of a template .service that sends email. So if we have certbot.timer and certbot.service, we add a .conf file in /etc/systemd/certbot.service.d that contains, say:

[Unit]
OnFailure=cslab-status-email@%n.service

Due to the use of '%n', this is generic; the stanza will be the same for anything we want to trigger email from on failure. The '%n' will expand to the full name of the service, eg 'certbot.service' and be available in the cslab-status-email@.service template unit. My view is that you should always use %n here even if you're only doing this for one service, because it automatically gets the unit name right for you (and why risk errors when you don't have to). In the cslab-status-email@.service unit, the full name of the unit triggering it will be available as '%i', as shown in the Arch Wiki's example. Here that will be 'certbot.service'.

(With probably excessive cleverness you could encode the local address to email to into what the template service will get as %i by triggering, eg, cslab-status-email@root-%n.service. We just hard code 'root' all through.)

The Arch Wiki's example script uses 'systemctl status --full <unit>'. Unfortunately this falls into the trap that by default systemd truncates the log output at the most recent ten lines. We found that we definitely wanted more; our script currently uses 'systemctl status --full -n 50 <unit>' (and also contains a warning postscript that it may be incomplete and to see journalctl on the system for full details). Having a large value here is harmless as far as I can tell, because systemd seems to only show the log output from the most recent activation attempt even if there's (much) less than your 50 lines or whatever.

(Unfortunately as far as I can see there is no easy way to get just the log output without the framing 'systemctl status' information about the unit, much of which is not particularly useful. We live with this.)

As with the Arch Wiki's example script, you definitely want to put the hostname into the email message if you have a fleet. We also embed more information into the Subject and From, and add a MIME-Version:

From: $HOSTNAME root <root@...>
Subject: $1 systemd unit failed on $HOSTNAME
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=UTF-8

You definitely want to label the email as UTF-8, as 'systemctl status' puts a UTF-8 '‚óŹ' in its output. The subject could be incorrect (we can't be sure the template unit was triggered through an 'OnFailure=', even that's how it's supposed to be used), but it's much more useful in the case where everything is working as intended. My bias is towards putting as much context into emails like this, because by the time we get one we'll have forgotten all about the issue and we don't want to be wondering why we got this weird email.

The Arch Wiki contains a nice little warning about how systemd may wind up killing child processes that the mail submission program creates (as noticed by @lathiat on Twitter). I decided that the easiest way for our script to ward off this was to just sleep for 10 or 15 seconds at the end. Having it exit immediately is not exactly critical and this is the easy (if brute force) way to hopefully work around any problems.

Finally, as the Arch Wiki kind of notes, this is not quite the same thing as what cron does. Cron will send you email if your job produces any output, whether or not it fails; this will send you the logged output (if any) if the job fails. If the job succeeds but produces output, that output will go only to the systemd journal and you will get no notification. As far as I know there's no good way to completely duplicate cron's behavior here.

(Also, on failure the journal messages you get will include both actual stuff printed by the service and also, I believe, anything it logged to places like syslog; with cron you only get the former. This is probably a useful feature.)

Written on 07 November 2019.
« Realizing that Go constants are always materialized into values
I have to assume that people here can be successfully phished »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Nov 7 23:30:42 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.