Our workaround for Ubuntu 16.04 and 18.04 failing to reliably reboot some of our servers
A few years ago I wrote about how and why systemd on Ubuntu 16.04 couldn't reliably reboot some of our servers. At the time I finished off the entry by suggesting that we'd live with the intermittent failures that caused some of our systems to hang during reboot attempts, forcing us to go power cycle them. Shortly afterward, we changed our minds and decided to work around the situation by resorting to a bigger hammer. These days we use our bigger hammer on both Ubuntu 16.04 and Ubuntu 18.04; the latter may have improved some aspects of the shutdown situation, but our experience is that it hasn't fixed all of them.
The fundamental problem is that systemd can leave descendant processes running even when it has nominally terminated a systemd service, such as Apache, cron, or Exim. These lingering processes are not killed (or attempted to be killed) until very late and can cause a variety of problems during NFS unmounts, turning off swap, or various other portions of system shutdown. To deal with this, we use the big hammer of doing it ourselves; during shutdown, we run a script to kill lingering processes from various service units.
The script has a list of systemd services. For each service, it first
looks in the systemd cgroup hierarchy to see if there are still
processes associated with the service, by counting how many lines there
are in /sys/fs/cgroup/systemd/system.slice/<what>.service/tasks. If
there are processes still associated with the service, it kills them
with SIGTERM
and then SIGKILL
(if necessary), using systemd itself
to do the work with:
systemctl --kill-who=all --signal=SIG... kill <what>.service
(The actual implementation is slightly more complicated.)
The script has a bunch of logging to report on whether it had to
do anything, what it did, and what the process tree looked like
before and after it did various killing (as reported through
systemd-cgls
, because that will show us what systemd units the
stray processes are under).
All of this is driven by a systemd .service unit with the following relevant bits:
[Unit] After=remote-fs.target Before=cron.service apache2.service exim4.service atd.service slurmd.service [Service] Type=oneshot RemainAfterExit=True ExecStop=/path/to/script
We set After
so that our stop action is run before NFS unmounting
starts, and Before
so that the stop action happens after those
listed services are shut down. Not all of those services exist and
are enabled on all machines, but listing a Before
service that
isn't enabled is harmless. The Before
list is basically 'what has
caused us problems'; we add things to it as we run into problem
services.
(Slurmd is a recent addition, for example.)
Right now the list of 'before' services is duplicated between the
script and the systemd unit. It feels tempting to try to eliminate
that, but on the other hand I'm not sure I want to be introspecting
systemd too much during shutdown. We could also try to be more
general by sniffing around the cgroup hierarchy to find stray
processes from any unit we don't whitelist (or at least any unit
that's theoretically been shut down). However, that might not be
very useful on modern systems, where 'KillMode=control-group
'
is the default.
The good news is that the script's logging suggests that it usually
doesn't need to do anything during system shutdown on our 18.04
machines. But usually isn't always, which is what prompted the
addition of slurmd.service
.
Sidebar: A potential alternate approach
Basically this is making these units behave as if they were set to
'KillMode=control-group
' during shutdown. You can change systemd
unit properties on the fly and only for the current system boot
(with 'systemctl --runtime set-property
', which we use for our
per-user CPU and memory limits), so
perhaps it would work to switch to this KillMode on the relevant
service units early in the shutdown process.
This option didn't even occur to me until I wrote this entry, and in general it seems more uncertain and chancy than just killing things (even if we're killing things indirectly through systemd). But it'd give you a much smaller and simpler script.
Comments on this page:
|
|