An irritating systemd behavior when you tell it to reboot the system

July 25, 2016

For reasons well beyond the scope of this entry, I don't use a graphical login program like gdm; I log in on the text console and start X by hand through xinit (which is sometimes annoying). When I want to log out, I cause the X server to exit and then log out of the text console as normal. Now, I don't know how gdm et al handle session cleanup, but for me this always leaves some processes lingering around that just haven't gotten the message to give up.

(Common offenders are kio_http_cache_cleaner and speech-dispatcher and its many friends. Speech-dispatcher is so irritating here that I actually chmod 700 the binary on my office and home machines.)

Usually the reason I'm logging out of my regular session is to reboot my machine, and this is where systemd gets irritating. Up through at least the Fedora 24 version of systemd, when it starts to reboot a machine and discovers lingering user processes still running, it will wait for them to exit. And wait. And wait more, for at least a minute and a half based on what I've seen printed. Only after a long timer expires will systemd send them various signals, ending in SIGKILL, and force them to exit.

(Based on reading manpages it seems that systemd sends user processes no signals at all at the start of a system shutdown. Instead it probably waits TimeoutStopSec, sends a SIGTERM, then waits TimeoutStopSec again before sending a SIGKILL. If you have a program that ignores everything short of SIGKILL, you're going to be waiting two timeout intervals here.)

At one level, this is not crazy behavior. Services like database engines may take some time to shut down cleanly, and you do want them to shut down cleanly if possible, so having a relatively generous timeout is okay (and the timeout can be customized). In fact, having a service have to be force-killed is (or should be) an exceptional thing and means that something has gone badly wrong. Services are supposed to have orderly shutdown procedures.

But all of that is for system services and doesn't hold for user session processes. For a start, user sessions generally don't have a 'stop' operation that gets run explicitly; the implicit operation is the SIGHUP that all the processes should have received as the user logged out. Next, user sessions are anarchic. They can contain anything, not just carefully set up daemons that are explicitly designed to shut themselves down on demand. In fact, lingering user processes are quite likely to be badly behaved. They're also generally considered clearly less important than system services, so there's no good reason to give them much grace period.

In theory systemd's behavior is perhaps justifiable. In practice, its generosity with user sessions simply serves to delay system reboots or shutdowns for irritatingly long amounts of time. This isn't a new issue with systemd (the Internet is full of complaints about it), but it's one that the systemd authors have let persist for years.

(I suspect the systemd authors probably feel that the existing ways to change this behavior away from the default are sufficient. My view is that defaults matter and should not be surprising.)

When I started writing this entry I expected it to just be a grump, but in fact it looks like you can probably fix this behavior. The default timeout for all user units can be set in /etc/systemd/user.conf with the DefaultTimeoutStopSec setting; set this down to less than 90 seconds and you'll get a much faster timeout. However I'm not sure if systemd will try to terminate a user scope other than during system shutdown, so it's possible that this setting will have other side effects. I'm tempted to try it anyways, just because it's so irritating when I slip up and forget to carefully kill all of my lingering session processes before running reboot.

Update: I'm wrong. Setting things in user.conf does nothing for the settings you get when you log in.

(You can also set KillUserProcesses in /etc/system/logind.conf, but that definitely will have side effects you probably don't want, even if some people are trying to deal with them anyways.)


Comments on this page:

By Miksa at 2016-07-26 05:16:19:

My first thought when reading this was, what happens after your day at work, when the cleaner arrives at your room and presses Ctrl-Alt-F1? How much access will they have to your computer? Will they be able to install malware in your home directory and modify your $PATH?

By cks at 2016-07-26 09:35:58:

In the old days, the risk was using Ctrl-Alt-F1 to get to the text console, ^Z'ing your X processes to get back to your shell, doing evil things, then fg'ing everything again so that you wouldn't notice. These days this doesn't work because the X server now takes over the console on that virtual terminal instead of starting itself on another one. The only way to get back to that text console shell is to kill the X server, at which point there's no way to make the attack transparent.

Could someone still get in? Yes, but they have physical access so I've lost that game already. Someone with physical access to my machine can boot it single-user or boot off alternate media. Locking a machine down against 'evil maid' level attacks is non-trivial, especially for desktop machines once you get actively paranoid.

(If someone attacks my machine this way, I can tell that something happened but if they know what they're doing I can't really tell it from 'my machine rebooted mysteriously in the middle of the night, oh well, that happens sometimes, maybe it was a power glitch'.)

By Alan at 2016-07-27 16:42:05:

This analysis looks novel to me. Thanks for writing it up.

It does feel like complainants will be pointed at KillUserProcesses.

I found a hole in the analysis. Consider the two complaints together. The analysis says that if you leave a tmux session open, it will cause this long shutdown delay. I found this surprising, so I tested it in a VM. It's not true. (Fedora 23, after installing the latest 300MB of uncached updates, as well as before. Starting tmux from a text virtual console and a GNOME Terminal respectively).

I can only think the tmux session is getting SIGTERM (or KILL). If programs involved in the delay do anything like what conky is accused of - ignoring SIGTERM in a race condition - it's harder to blame systemd people. (What do want, a compat mode that matches historical sysvinit timings, or to send multiple SIGTERMs, when we're assuming software has ugly race conditions?) systemd-cgls still shows a detached tmux under my session scope, so it doesn't look like tmux is doing anything special (yet).

I think systemd is missing a trick though; it only shows the unit that's timing out, right? So in this case it just shows you the user scope. If systemd could highlight conky or whatever specifically, that could help everyone.

(I imagine the described behaviour e.g. of the KDE software would be undesirable for a public machine :p. Not very impressive, but as you say sometimes this stuff falls through ugly cracks).

By cks at 2016-07-27 18:08:37:

It turns out that systemd sends SIGTERM (and then SIGHUP) to lingering user processes at the start of shutdown. If they don't block SIGTERM (and I don't think things like tmux and screen do), they'll exit immediately or almost immediately and the shutdown will go on. Processes only stall shutdown if they ignore SIGTERM (and SIGHUP), which they really shouldn't. Unfortunately there are some very special programs out there that believe they are just that important.

(The Fedora version of kio_http_cache_cleaner is one of them; as you can see in /proc/<pid>/status if you carefully decode SigIgn, it ignores HUP, INT, QUIT, and TERM.)

By Alan at 2016-08-07 05:18:58:

"systemd [231] will now log all service processes that it kills forcibly when they fail to shut down cleanly."

I don't know if it was designed for for normal shutdowns (v.s. /var/log becoming readonly) but it's a welcome improvement in troubleshooting.

Written on 25 July 2016.
« I should learn more about Grub2
When 'simple' DNS blocklists work well for you »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Mon Jul 25 23:35:47 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.