Systemd, NFS mounts, and shutting down your system

September 7, 2017

After writing about our systemd reboot problem, I decided that I was irritated enough to spend part of today trying to dig into the situation (partly because of all of the comments and reactions to my entry, since they raised good questions and suggestions). I don't have any definite answers, partly because it's quite hard to see the state of the system when this is happening, but I do have some observations and notes (and some potentially useful contributions from other people on Twitter).

Here is what I've seen during shutdowns:

Under some circumstances, systemd will fail to unmount a NFS filesystem because processes are holding it busy but will go on to take down networking.

This is a dangerous situation to wind up in. With networking down, any attempt by a process to do anything on the filesystem will almost certainly give you an unrecoverably hung process; it's waiting for NFS IO to complete, but NFS IO requires a network to talk to the server, and the network isn't there and isn't coming back. Unfortunately it's disturbingly easy to wind up in this situation, thanks to our friend cron and other similar things.

Systemd appears to terminate processes in user slices reasonably early in the shutdown process, definitely before it starts trying to unmount NFS filesystems. However, as we've seen, not all 'user' processes are under user slices; some of them are hanging out in places like cron.service and apache2.service. Now, you might think that cron jobs and CGI processes and so on should be killed when systemd shuts down cron and Apache (which it normally does before unmounting NFS filesystems), but unfortunately both cron and Apache are set to KillMode=process, where systemd only terminates the main process when it stops the service. So all of your cron jobs, CGI processes, and so on linger on until systemd gets around to killing them much later (I believe as part of running systemd-shutdown but I'm not sure).

(You can have this issue with any systemd service or daemon that starts multiple processes but uses KillMode=process. I believe that all System V init scripts handled through systemd's backwards compatibility implicitly run in this mode; certainly Ubuntu 16.04's /etc/init.d/apache2 does.)

As it happens, our user-managed web servers mostly get run from cron with @reboot entries (as the simple way to start on system boot). I suspect that it's not a coincidence that our web server almost always experiences a hang during reboots. We have another server that often experiences reboot hangs, and there people use at and atd.

(The mere presence of lingering processes doesn't doom you, because they might not try to do any (NFS) IO when systemd sends them a SIGTERM. However, any number of things may react to SIGTERM by trying to do cleanups, for example by writing out a database or a log record, and if they are running from a NFS filesystem that is now cut off from the network, this is a problem.)

All of this description sounds very neat and pat, but it's clearly not the full story because I can't consistently reproduce a shutdown hang although I can consistently create cut-off NFS mounts with not yet killed processes that are holding them busy (although I've got some more ideas to try). This gets me around to the things that don't work and one thing that might.

In comments, Alan noted that the stock systemd poweroff.target and reboot.target both have 30 minute timeouts, after which they force a poweroff or a reboot. Unfortunately these timeouts don't seem to be triggering in my tests, for whatever reason; I left a hung VM sitting there for well over half an hour at one point with its reboot.target timeout clearly not triggering.

On Twitter, Mike Kazentsev mentioned that system.conf has a ShutdownWatchdogSec option to use a hardware watchdog to force a reboot if the system becomes sufficiently unresponsive. Unfortunately this watchdog doesn't limit the total reboot time, because systemd-shutdown pings it every time it loops around trying to unmount filesystems and turn off swap space and so on. As long as systemd-shutdown thinks it's making some progress, the watchdog won't fire. Setting the watchdog low will protect you against systemd-shutdown hanging, though, and that may be worthwhile.

(Also, as I found out last year and then forgot until I painfully rediscovered it today, you can't easily reduce the timeout on user slices so that lingering processes in user slices are terminated faster on shutdown. This means that on many machines, you can be stuck with a more than 90 second shutdown time in general.)

Sidebar: The obvious brute force solution for us

As far as we know, our problems come from processes run by actual real people, not from system daemons that are lingering around. These users exist in a defined UID range, so it wouldn't be particularly difficult to write a program that scanned /proc for not-yet-killed user processes and tried to terminate them all. We could try to be creative about the ordering of this program during shutdown (so it ran after systemd had already shut down as many user scopes and slices as possible), or just run it based on convenient dependencies and accept that it would kill processes that systemd would clean up on its own.

Written on 07 September 2017.
« Systemd on Ubuntu 16.04 can't (or won't) reliably reboot your server
My view of the problem with Extended Validation TLS certificates »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Sep 7 01:28:14 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.