Sorting out my systemd mistake with a script-based service unit

April 3, 2018

Back in November I wrote about a systemd mistake I made with a script-based service unit, where I left out some service options and got a surprise when my service didn't work. A commentator recently made me realize that I didn't really understand what was going on and what had happened; instead I was working by superstition. So I've now done some experiments and read the systemd.service manpage again, and here's what I know.

The basic situation was that I wrote a .service file that had just this, where ExecStart and ExecStop are scripts that just run briefly and then exit:

[Service]
WorkingDirectory=/var/local/wireguard
ExecStart=/var/local/wireguard/startup
ExecStop=/var/local/wireguard/stop
Environment=LANG=C

(In this situation, systemd's defaults are that there is an implicit Type=simple and the default RemainAfterExit=no.)

If you don't have RemainAfterExit and your ExecStart exits with status 0, your service becomes inactive (as opposed to failing to start). If you have an ExecStop, systemd will then run it, even though you haven't explicitly asked for a 'stop' operation; in my situation this mysteriously reversed the effects of my start script. That this happens is unfortunately not clearly documented anywhere that I could see, although it makes a certain amount of sense if you consider ExecStop to be for cleanup actions, where you often want the cleanup actions to happen if the service started successfully and then stops, regardless of just why the service stopped.

(Looking through the stock Fedora 27 systemd .service units, quite a lot of the ExecStop actions appear to be this sort of cleanup, not 'signal the service to shut down' actions.)

It's easy to see this with a test service that just runs some scripts. You'll get output from 'systemctl status yourtest.service' that looks like this:

    Active: inactive (dead) since Mon 2018-04-02 15:37:19 EDT; 41s ago
   Process: 973 ExecStop=/root/stop-script (code=exited, status=0/SUCCESS)
   Process: 964 ExecStart=/root/start-script (code=exited, status=0/SUCCESS)
  Main PID: 964 (code=exited, status=0/SUCCESS)

The ExecStart script ran, was considered the main PID, exited with status 0, and then shortly afterward the ExecStop script was run (since it has a PID only a bit higher, and the start script ran a couple of commands).

Contrary to what I thought in my first entry, the Type=oneshot doesn't affect this as such. As my commentator noted, what Type=oneshot instead of Type=simple really affects is when other units will get started. If you have test.service with the implicit Type=simple, and another service says 'After=test.service', your other service will get started the moment that systemd has started running test.service's ExecStart. This is often not what you want; instead you want things that depend on test.service to only start when its ExecStart has finished preparing things and exited. That's what Type=oneshot enforces, by making it so that test.service is only considered 'started' when your ExecStart program or script exits. Systemd does more or less document this, at the end of the description of Type=:

If set to simple [...], as systemd will immediately proceed starting follow-up units.

[...]

Behavior of oneshot is similar to simple; however, it is expected that the process has to exit before systemd starts follow-up units. [...]

(This is not particularly clearly written, unfortunately. Energetic people can propose a documentation patch in the master repo.)

As the documentation notes, Type=oneshot probably mostly requires that you use RemainAfterExit=yes, because otherwise the service won't be considered to be active. Certainly things will be much less confusing if you use it, because then all of the units involved will stay 'active' and you won't ever have the experience of wondering why something is up and running despite a dependency having failed.

(After= doesn't actually create a dependency, of course, just an ordering. But that's another entry.)

Written on 03 April 2018.
« Link: Closing the Loop: The Importance of External Engagement in Computer Science Research
Today's learning experience is that gzip is not fast »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Apr 3 00:01:30 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.