Systemd, NFS mounts, and shutting down your system

September 7, 2017

After writing about our systemd reboot problem, I decided that I was irritated enough to spend part of today trying to dig into the situation (partly because of all of the comments and reactions to my entry, since they raised good questions and suggestions). I don't have any definite answers, partly because it's quite hard to see the state of the system when this is happening, but I do have some observations and notes (and some potentially useful contributions from other people on Twitter).

Here is what I've seen during shutdowns:

Under some circumstances, systemd will fail to unmount a NFS filesystem because processes are holding it busy but will go on to take down networking.

This is a dangerous situation to wind up in. With networking down, any attempt by a process to do anything on the filesystem will almost certainly give you an unrecoverably hung process; it's waiting for NFS IO to complete, but NFS IO requires a network to talk to the server, and the network isn't there and isn't coming back. Unfortunately it's disturbingly easy to wind up in this situation, thanks to our friend cron and other similar things.

Systemd appears to terminate processes in user slices reasonably early in the shutdown process, definitely before it starts trying to unmount NFS filesystems. However, as we've seen, not all 'user' processes are under user slices; some of them are hanging out in places like cron.service and apache2.service. Now, you might think that cron jobs and CGI processes and so on should be killed when systemd shuts down cron and Apache (which it normally does before unmounting NFS filesystems), but unfortunately both cron and Apache are set to KillMode=process, where systemd only terminates the main process when it stops the service. So all of your cron jobs, CGI processes, and so on linger on until systemd gets around to killing them much later (I believe as part of running systemd-shutdown but I'm not sure).

(You can have this issue with any systemd service or daemon that starts multiple processes but uses KillMode=process. I believe that all System V init scripts handled through systemd's backwards compatibility implicitly run in this mode; certainly Ubuntu 16.04's /etc/init.d/apache2 does.)

As it happens, our user-managed web servers mostly get run from cron with @reboot entries (as the simple way to start on system boot). I suspect that it's not a coincidence that our web server almost always experiences a hang during reboots. We have another server that often experiences reboot hangs, and there people use at and atd.

(The mere presence of lingering processes doesn't doom you, because they might not try to do any (NFS) IO when systemd sends them a SIGTERM. However, any number of things may react to SIGTERM by trying to do cleanups, for example by writing out a database or a log record, and if they are running from a NFS filesystem that is now cut off from the network, this is a problem.)

All of this description sounds very neat and pat, but it's clearly not the full story because I can't consistently reproduce a shutdown hang although I can consistently create cut-off NFS mounts with not yet killed processes that are holding them busy (although I've got some more ideas to try). This gets me around to the things that don't work and one thing that might.

In comments, Alan noted that the stock systemd and both have 30 minute timeouts, after which they force a poweroff or a reboot. Unfortunately these timeouts don't seem to be triggering in my tests, for whatever reason; I left a hung VM sitting there for well over half an hour at one point with its timeout clearly not triggering.

On Twitter, Mike Kazentsev mentioned that system.conf has a ShutdownWatchdogSec option to use a hardware watchdog to force a reboot if the system becomes sufficiently unresponsive. Unfortunately this watchdog doesn't limit the total reboot time, because systemd-shutdown pings it every time it loops around trying to unmount filesystems and turn off swap space and so on. As long as systemd-shutdown thinks it's making some progress, the watchdog won't fire. Setting the watchdog low will protect you against systemd-shutdown hanging, though, and that may be worthwhile.

(Also, as I found out last year and then forgot until I painfully rediscovered it today, you can't easily reduce the timeout on user slices so that lingering processes in user slices are terminated faster on shutdown. This means that on many machines, you can be stuck with a more than 90 second shutdown time in general.)

Sidebar: The obvious brute force solution for us

As far as we know, our problems come from processes run by actual real people, not from system daemons that are lingering around. These users exist in a defined UID range, so it wouldn't be particularly difficult to write a program that scanned /proc for not-yet-killed user processes and tried to terminate them all. We could try to be creative about the ordering of this program during shutdown (so it ran after systemd had already shut down as many user scopes and slices as possible), or just run it based on convenient dependencies and accept that it would kill processes that systemd would clean up on its own.

Comments on this page:

From at 2017-09-07 01:58:25:

There have been a few recent commits which may be relevant (though not tested personally as I don't use NFS much): (don't remount network filesystems after bringing down network) (add support for umount -l) (avoid touching hung filesystems when unmounting) (as above)

By jinks at 2017-09-07 02:26:51:

This may be naïve and probably ignores a ton of real-world scenarios, but what about running some equivalent of fuser -mk <mountpoint> before trying to unmount any specific filesystem?

systemd-shutdown is definitely what ends up killing the lingering escapee processes.

systemd-shutdown isn't subject to the 30 minute timeout. That only applies to stopping/starting systemd units. The last systemd unit just sends a message to systemd, which causes it to exec() systemd-shutdown. (For readers not familiar, exec() transmogrifies the running program into another one. So PID 1 becomes systemd-shutdown).

(AIUI systemd-shutdown doesn't have an overall timeout. Instead, it basically gives up once it's done SIGKILL and unmount() everywhere it can, and it's not making any progress. SIGKILL is instantaneous, but unmount potentially wants a timeout, which is what the PR I linked to fixes AIUI. Of course if you timeout a unmount, you're potentially dropping unsynced data, so you're in the land of tradeoffs that you really want to avoid in the first place.

I would guess the apache unit is only expected to cause problems (with the KillMode=process horror show) when you have a CGI calling system() or something.

I guess a workaround would be to write an service override file to set KillMode=cgroup. AFAIK it would make sense on most services... the main exception being that when you restart the SSH daemon you might not appreciate your session being killed).

User scopes being shut down early sounds right. I can't see any Before= depedencies if I look at them. They have After= on various services... Ah. Importantly including After=systemd-user-sessions.service. That in turn comes Therefore, during shutdown, NFS filesystems are guaranteed not to be stopped until after user scopes have been stopped.

One potential reason for cron having remained as `KillMode=process` is it allows cron to be restarted by an automatic update run by cron itself. Package managers don't really appreciate being killed. It sounds like systems are being expected to 1) default to [cron using pam_systemd]( (F26: good, Debian 9: bad), 2) provide a native systemd unit for cron. I expect atd wants to be handled the same way.

I assume apache needs converting to a systemd unit with the systemd default KillMode. (and users of `graceful` get to use apache2ctl directly)

By cks at 2017-09-08 12:16:09:

As a side note, it should be harmless to set ssh(d) to KillMode=control-group, because user processes from SSH logins and so on should never wind up running under the sshd.service cgroup; they should all be put into session scopes under user slices. This is what appears to happen on Ubuntu 16.04, CentOS 7, and Fedora 26; on all of them, the only process in sshd.service is a single lone sshd -D process. Of course this makes the KillMode basically irrelevant.

Written on 07 September 2017.
« Systemd on Ubuntu 16.04 can't (or won't) reliably reboot your server
My view of the problem with Extended Validation TLS certificates »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Sep 7 01:28:14 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.