Our workaround for Ubuntu 16.04 and 18.04 failing to reliably reboot some of our servers

September 25, 2019

A few years ago I wrote about how and why systemd on Ubuntu 16.04 couldn't reliably reboot some of our servers. At the time I finished off the entry by suggesting that we'd live with the intermittent failures that caused some of our systems to hang during reboot attempts, forcing us to go power cycle them. Shortly afterward, we changed our minds and decided to work around the situation by resorting to a bigger hammer. These days we use our bigger hammer on both Ubuntu 16.04 and Ubuntu 18.04; the latter may have improved some aspects of the shutdown situation, but our experience is that it hasn't fixed all of them.

The fundamental problem is that systemd can leave descendant processes running even when it has nominally terminated a systemd service, such as Apache, cron, or Exim. These lingering processes are not killed (or attempted to be killed) until very late and can cause a variety of problems during NFS unmounts, turning off swap, or various other portions of system shutdown. To deal with this, we use the big hammer of doing it ourselves; during shutdown, we run a script to kill lingering processes from various service units.

The script has a list of systemd services. For each service, it first looks in the systemd cgroup hierarchy to see if there are still processes associated with the service, by counting how many lines there are in /sys/fs/cgroup/systemd/system.slice/<what>.service/tasks. If there are processes still associated with the service, it kills them with SIGTERM and then SIGKILL (if necessary), using systemd itself to do the work with:

systemctl --kill-who=all --signal=SIG... kill <what>.service

(The actual implementation is slightly more complicated.)

The script has a bunch of logging to report on whether it had to do anything, what it did, and what the process tree looked like before and after it did various killing (as reported through systemd-cgls, because that will show us what systemd units the stray processes are under).

All of this is driven by a systemd .service unit with the following relevant bits:

[Unit]
After=remote-fs.target
Before=cron.service apache2.service exim4.service atd.service slurmd.service

[Service]
Type=oneshot
RemainAfterExit=True
ExecStop=/path/to/script

We set After so that our stop action is run before NFS unmounting starts, and Before so that the stop action happens after those listed services are shut down. Not all of those services exist and are enabled on all machines, but listing a Before service that isn't enabled is harmless. The Before list is basically 'what has caused us problems'; we add things to it as we run into problem services.

(Slurmd is a recent addition, for example.)

Right now the list of 'before' services is duplicated between the script and the systemd unit. It feels tempting to try to eliminate that, but on the other hand I'm not sure I want to be introspecting systemd too much during shutdown. We could also try to be more general by sniffing around the cgroup hierarchy to find stray processes from any unit we don't whitelist (or at least any unit that's theoretically been shut down). However, that might not be very useful on modern systems, where 'KillMode=control-group' is the default.

The good news is that the script's logging suggests that it usually doesn't need to do anything during system shutdown on our 18.04 machines. But usually isn't always, which is what prompted the addition of slurmd.service.

Sidebar: A potential alternate approach

Basically this is making these units behave as if they were set to 'KillMode=control-group' during shutdown. You can change systemd unit properties on the fly and only for the current system boot (with 'systemctl --runtime set-property', which we use for our per-user CPU and memory limits), so perhaps it would work to switch to this KillMode on the relevant service units early in the shutdown process.

This option didn't even occur to me until I wrote this entry, and in general it seems more uncertain and chancy than just killing things (even if we're killing things indirectly through systemd). But it'd give you a much smaller and simpler script.


Comments on this page:

By aioeu at 2019-09-25 01:23:52:

This is presumably only a problem with services still using initscripts, since systemd will generate units for those with KillMode=process to match pre-systemd behaviour.

Why not just use systemctl edit on those generated units and set KillMode=control-group permanently?

By cks at 2019-09-25 08:32:37:

A number of standard Ubuntu systemd units are explicitly set to KillMode=process, most notably cron. This is sensible and even desirable behavior for something like cron, where you don't normally want currently running cron jobs being killed if you restart the daemon itself (as the system might automatically do during things like package updates).

Now that I look at our systems, 18.04 has fewer of them than I thought, which would explain why 18.04 has been less problematic here for us. The only two that stand out as dangerous are cron and slurmd. Apache, Exim, and even atd now leave it at the default and so terminate everything when the service is shut down.

Written on 25 September 2019.
« How we implement per-user CPU and memory resource limits on Ubuntu
It's always convenient when malware is clear about its nature (7z edition) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Sep 25 00:44:54 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.