A systemd mistake with a script-based service unit I recently made

November 10, 2017

I tweeted:

That sure was a bunch of debugging because I forgot that my systemd .service file that runs scripts needed

Type=oneshot
RemainAfterExit=True

(... or it'd apparently run the ExecStop script right after the ExecStart script, which doesn't work too well.)

Let's be specific here. This was the systemd .service unit to bring up my WireGuard tunnel on my work machine, which I set up to run a 'startup' script (via ExecStart=). Because I had a 'stop' script sitting around, I also set the unit's ExecStop= to point to that; the 'stop' script takes the device down and so on.

The startup script worked when I ran it by hand, but when I set up the .service unit to start WireGuard on boot, it didn't. Specifically, although journalctl reported no errors, the WireGuard tunnel network device and its associated routes just weren't there when the system finished booting. At first I thought the script was failing in a way that the systemd journal wasn't capturing, so I stuck a bunch of debugging in (capturing all output from the script in a file, and then running with 'set -x', and finally dumping out various pieces of network state after the script had finished).

All of this debugging convinced me that the WireGuard tunnel was being created during boot but then getting destroyed by the time booting finished. I flailed around for a while theorizing that this service or that service was destroying the WireGuard device when it was starting (and altering my .service to start after a steadily increasing number of other things), but nothing fixed the issue. Then, while I was starting at my .service file, the penny dropped and I actually read what was in front of my eyes:

[Service]
WorkingDirectory=/var/local/wireguard
ExecStart=/var/local/wireguard/startup
ExecStop=/var/local/wireguard/stop
Environment=LANG=C

This .service file had started out life as one that I'd copied from another .service file of mine. However, that .service file was for a daemon, where the ExecStart= was a process that was sticking around. I was running a script, and the script was exiting, which meant that as far as systemd was concerned the service was going down and it should immediately run the ExecStop script. My 'stop' script deleted the WireGuard tunnel network device, which explained why I found the device missing after booting had finished.

The journalctl output won't tell you this; it reports only that the service started and not mention that it's stopped again and that the ExecStop script was run. If I'd looked at 'systemctl status ...' and paid attention, I'd at least have had a clue because systemd would have told me that it thought that the service was 'inactive (dead)' instead of running. If I'd had both scripts explicitly log that they were running, I would have seen in the logs that my 'stop' script was being executed for some reason; I probably should add this.

This has been a pretty useful learning experience. I know, that probably sounds weird, but my view is that I'd rather make these mistakes and learn these lessons in a non-urgent, non-production situation instead of stubbing my toes on them in production and possibly under stressful conditions.


Comments on this page:

By ano at 2018-03-31 20:37:06:

A bit late to the party, but had also recently had to familiarize myself with this thing. AFAIU the problem is not the ExecStop, it's Type=oneshot. ExecStop will be called upon systemctl stop service, not on its own, and oneshot means and I quote "it is expected that the process has to exit before systemd starts follow-up units" and I think it implies that systemd will kill it unless it dies (with -9 fire, not your nice ExecStop command). RemainAfterExit doesn't tell systemd that the process shall remain started, but that the service shall remain active, which is not at all the same.

Just in case someone will find it helpful. And if one does, please read carefully https://www.freedesktop.org/software/systemd/man/systemd.service.html

By cks at 2018-10-10 01:20:42:

Very belatedly: I dug into this whole area as a result of ano's comment (very shortly after it was made), and wrote up what I found out in this entry. It turns out to be more complicated than I was expecting and (of course) not clearly documented.

Written on 10 November 2017.
« Why I'm not enthused about live patching kernels and systems
What X11's TrueColor means (with some history) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Nov 10 01:44:04 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.