Some notes on getting email when your systemd timer services fail
Suppose, not hypothetically, that you
have some things that are implemented through systemd timers instead
of traditional cron.d
jobs, and you would like to get email if
and when they fail. The lack of this email by default is one of the
known issues with turning cron.d
entries into systemd timers and
people have already come up with ways to do this with systemd tricks,
so for full details I will refer you to the Arch Wiki section on
this
(brought to my attention by keur's comment on my initial entry) and this serverfault question and its
answers
(via @tvannahl on Twitter). This
entry is my additional notes from having set this up for our Certbot
systemd timers.
Systemd timers come in two parts; a .timer
unit that controls
timing and a .service
unit that does the work. What we generally
really care about is the .service
unit failing. To detect this
and get email about it, we add an OnFailure=
to the timer's
.service
unit that triggers a specific instance of a template
.service
that sends email. So if we have certbot.timer
and
certbot.service
, we add a .conf file in /etc/systemd/certbot.service.d
that contains, say:
[Unit] OnFailure=cslab-status-email@%n.service
Due to the use of '%n
', this is generic; the stanza will be the
same for anything we want to trigger email from on failure. The
'%n
' will expand to the full name of the service, eg 'certbot.service
'
and be available in the cslab-status-email@.service
template unit.
My view is that you should always use %n here even if you're only
doing this for one service, because it automatically gets the unit
name right for you (and why risk errors when you don't have to).
In the cslab-status-email@.service unit, the full name of the
unit triggering it will be available as '%i
', as shown in the
Arch Wiki's example. Here that will be 'certbot.service
'.
(With probably excessive cleverness you could encode the local
address to email to into what the template service will get as %i
by triggering, eg, cslab-status-email@root-%n.service. We just hard
code 'root
' all through.)
The Arch Wiki's example script uses 'systemctl status --full
<unit>
'. Unfortunately this falls into the trap that by default
systemd truncates the log output at the most recent ten lines. We
found that we definitely wanted more; our script currently uses
'systemctl status --full -n 50 <unit>
' (and also contains a warning
postscript that it may be incomplete and to see journalctl
on the
system for full details). Having a large value here is harmless as
far as I can tell, because systemd seems to only show the log output
from the most recent activation attempt even if there's (much) less
than your 50 lines or whatever.
(Unfortunately as far as I can see there is no easy way to get just the log output without the framing 'systemctl status' information about the unit, much of which is not particularly useful. We live with this.)
As with the Arch Wiki's example script, you definitely want to
put the hostname into the email message if you have a fleet. We
also embed more information into the Subject and From, and add
a MIME-Version
:
From: $HOSTNAME root <root@...> Subject: $1 systemd unit failed on $HOSTNAME MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8
You definitely want to label the email as UTF-8, as 'systemctl
status
' puts a UTF-8 '●
' in its output. The subject could be
incorrect (we can't be sure the template unit was triggered through
an 'OnFailure=
', even that's how it's supposed to be used), but
it's much more useful in the case where everything is working as
intended. My bias is towards putting as much context into emails
like this, because by the time we get one we'll have forgotten
all about the issue and we don't want to be wondering why we got
this weird email.
The Arch Wiki contains a nice little warning about how systemd may wind up killing child processes that the mail submission program creates (as noticed by @lathiat on Twitter). I decided that the easiest way for our script to ward off this was to just sleep for 10 or 15 seconds at the end. Having it exit immediately is not exactly critical and this is the easy (if brute force) way to hopefully work around any problems.
Finally, as the Arch Wiki kind of notes, this is not quite the same thing as what cron does. Cron will send you email if your job produces any output, whether or not it fails; this will send you the logged output (if any) if the job fails. If the job succeeds but produces output, that output will go only to the systemd journal and you will get no notification. As far as I know there's no good way to completely duplicate cron's behavior here.
(Also, on failure the journal messages you get will include both actual stuff printed by the service and also, I believe, anything it logged to places like syslog; with cron you only get the former. This is probably a useful feature.)
|
|