Why a (Linux) service delaying its shutdown is a bad thing

March 30, 2022

Over on Twitter, I said something:

Every Linux daemon that refuses to stop during a reboot for "good reasons" needs to understand that it's delaying the system's return to service by a minute and a half (systemd's timeout) or more. When that's my desktop, I get quite angry with the daemon. Hi, PackageKit.

When I type 'reboot' (or invoke the equivalent in a GUI), my machine immediately goes out of service. My desktop session ends if there is one, I and everyone else on the machine get logged off, daemons start dying left and right, et cetera. The machine will only come back into service when it completes both the shutdown and the boot that follows it. One part of the delay is how fast the machine boots. The other part is how fast the machine shuts down. People pay a lot of attention to how long it takes to boot a system. They pay much less attention to how long it takes to shut one down, despite this often being a good portion of the practical return to service time.

One of the things that goes wrong in shutdown on systemd based systems is when some daemon (more generally, some service or even a session) refuses to shut down immediately. On systemd based systems, things that don't shut down trigger what is by default a 90 second timeout (this is system.conf's DefaultTimeoutStopSec). As covered in TimeoutStopSec, systemd will wait this long before forcefully killing the service's processes and letting the reboot continue. In other words, the reboot takes an extra minute and a half (at least), so your machine is out of service for an extra minute and a half (at least).

At one level this is not really systemd's fault. Systemd is not causing the service to be slow to stop; instead, systemd is unusual in init systems in that it actually checks to see if the service really has stopped. In the old System V style Linux init system, init ran each /etc/init.d/<whatever> script with a 'stop' argument and assumed that when the script exited, the service had shut down. If this wasn't the case, init mostly didn't stop to notice; when the system was actually rebooted, those remaining processes generally got terminated very abruptly by the kernel. People did notice and complain about init scripts that had slow 'stop' actions (and so those mostly got fixed), but they didn't notice lingering processes.

(When you tell the Linux kernel to reboot, it takes you at your word.)

Services that take more than a few seconds to shut themselves down, especially in ordinary operation, have a bug. One reason this is a bug is that there's absolutely no guarantee that you have very much time before the system as a whole goes down, for example because the UPS battery power is about to run out. Plus, the systemd timeout can be set to much lower (and some people do), so your processes can be abruptly terminated after short times even in ordinary circumstances. And slow service shutdowns delay the system's return to service (and leave people drumming their fingers, unable to do anything with their machine because it has functionally hung).

(I've written about this general shutdown delay issue before in SystemdRebootIrritation, but that wasn't focused on badly behaved services.)

Written on 30 March 2022.
« Fixing Pipx when you upgrade your system Python version
The awkward timing of Fedora and Go releases »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Mar 30 22:30:56 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.