Systemd, NFS mounts, and shutting down your system
After writing about our systemd reboot problem, I decided that I was irritated enough to spend part of today trying to dig into the situation (partly because of all of the comments and reactions to my entry, since they raised good questions and suggestions). I don't have any definite answers, partly because it's quite hard to see the state of the system when this is happening, but I do have some observations and notes (and some potentially useful contributions from other people on Twitter).
Here is what I've seen during shutdowns:
Under some circumstances, systemd will fail to unmount a NFS filesystem because processes are holding it busy but will go on to take down networking.
This is a dangerous situation to wind up in. With networking down,
any attempt by a process to do anything on the filesystem will
almost certainly give you an unrecoverably hung process; it's waiting
for NFS IO to complete, but NFS IO requires a network to talk to
the server, and the network isn't there and isn't coming back.
Unfortunately it's disturbingly easy to wind up in this situation,
thanks to our friend cron
and other
similar things.
Systemd appears to terminate processes in user slices reasonably
early in the shutdown process, definitely before it starts trying
to unmount NFS filesystems. However, as we've seen, not all 'user' processes are under
user slices; some of them are hanging out in places like cron.service
and apache2.service
. Now, you might think that cron jobs and CGI
processes and so on should be killed when systemd shuts down cron
and Apache (which it normally does before unmounting NFS filesystems),
but unfortunately both cron and Apache are set to KillMode=process
,
where systemd only terminates the main process when it stops the
service. So all of your cron jobs, CGI processes, and so on linger
on until systemd gets around to killing them much later (I believe
as part of running systemd-shutdown
but I'm not sure).
(You can have this issue with any systemd service or daemon that
starts multiple processes but uses KillMode=process
. I believe
that all System V init scripts handled through systemd's backwards
compatibility implicitly run in
this mode; certainly Ubuntu 16.04's /etc/init.d/apache2
does.)
As it happens, our user-managed web servers
mostly get run from cron with @reboot
entries (as the simple
way to start on system boot). I suspect
that it's not a coincidence that our web server almost always
experiences a hang during reboots. We have another server that often
experiences reboot hangs, and there people use at
and atd
.
(The mere presence of lingering processes doesn't doom you, because
they might not try to do any (NFS) IO when systemd sends them a
SIGTERM
. However, any number of things may react to SIGTERM
by
trying to do cleanups, for example by writing out a database or a
log record, and if they are running from a NFS filesystem that is
now cut off from the network, this is a problem.)
All of this description sounds very neat and pat, but it's clearly not the full story because I can't consistently reproduce a shutdown hang although I can consistently create cut-off NFS mounts with not yet killed processes that are holding them busy (although I've got some more ideas to try). This gets me around to the things that don't work and one thing that might.
In comments, Alan noted that the
stock systemd poweroff.target
and reboot.target
both have 30
minute timeouts, after which they force a poweroff or a reboot.
Unfortunately these timeouts don't seem to be triggering in my
tests, for whatever reason; I left a hung VM
sitting there for well over half an hour at one point with its
reboot.target
timeout clearly not triggering.
On Twitter,
Mike Kazentsev mentioned that
system.conf
has a ShutdownWatchdogSec
option to use a hardware watchdog to
force a reboot if the system becomes sufficiently unresponsive.
Unfortunately this watchdog doesn't limit the total reboot time,
because systemd-shutdown pings it every time it loops around
trying to unmount filesystems and turn off swap space and so on.
As long as systemd-shutdown thinks it's making some progress,
the watchdog won't fire. Setting the watchdog low will protect
you against systemd-shutdown hanging, though, and that may be
worthwhile.
(Also, as I found out last year and then forgot until I painfully rediscovered it today, you can't easily reduce the timeout on user slices so that lingering processes in user slices are terminated faster on shutdown. This means that on many machines, you can be stuck with a more than 90 second shutdown time in general.)
Sidebar: The obvious brute force solution for us
As far as we know, our problems come from processes run by actual
real people, not from system daemons that are lingering around.
These users exist in a defined UID range, so it wouldn't be
particularly difficult to write a program that scanned /proc
for
not-yet-killed user processes and tried to terminate them all. We
could try to be creative about the ordering of this program during
shutdown (so it ran after systemd had already shut down as many
user scopes and slices as possible), or just run it based on
convenient dependencies and accept that it would kill processes
that systemd would clean up on its own.
Comments on this page:
|
|