Using a watchdog timer in system shutdown with systemd (on Ubuntu 16.04)

September 22, 2017

In Systemd, NFS mounts, and shutting down your system, I covered how Mike Kazantsev pointed me at the ShutdownWatchdogSec setting in system.conf as a way of dealing with our reboot hang issues. I also alluded to some issues with it. We've now tested and deployed a setup using this, so I want to walk through how it works and what its limitations are. As part of that I need to talk about how systemd actually shuts down your system.

Under systemd, system shutdown happens in two stages. The first stage is systemd stopping all of the system units that it can, in whatever way or ways they're configured to stop. Some units may fail to stop here and some processes may not be killed by their unit's 'stop' action(s), for example processes run by cron. This stage is the visible part of system shutdown, the bit that causes systemd to print out all of its console messages. It ends when systemd reaches shutdown.target, which is when you get console messages like:

[...]
[ OK ] Stopped Remount Root and Kernel File Systems.
[ OK ] Stopped Create Static Device Nodes in /dev.
[ OK ] Reached target Shutdown.

(There are apparently a few more magic systemd targets and services that get invoked here without producing any console messages.)

The second stage starts when systemd transfers control (and being PID 1) to the special systemd-shutdown program in order to do the final cleanup and shutdown of the system (the manual page describes why it exists and you can read the actual core code here). Simplified, systemd-shutdown SIGTERMs and then SIGKILLs all remaining processes and then enters a loop where it attempts to unmount any remaining filesystems, deactivate any remaining swap devices, and shut down remaining loop and DM devices. If all such things are gone or systemd-shutdown makes no progress at all, it goes on to do the actual reboot. Unless you turn on systemd debugging (and direct it to the console), systemd-shutdown is completely silent about all of this; it prints nothing when it starts and nothing as it runs. Normally this doesn't matter because it finishes immediately and without problems.

Based on the manpage, you might think that ShutdownWatchdogSec limits the total amount of time a shutdown can take and thus covers both of these stages. This is not the case; the only thing that ShutdownWatchdogSec does is put a watchdog timer on systemd-shutdown's end-of-things work in the second stage. Well, sort of. If you read the manpage, you'd probably think that the time you configure here is the time limit on the second stage as a whole, but actually it's only the time limit on each of those 'try to clean up remaining things' loops. systemd-shutdown resets the watchdog every time it starts a trip through the loop, so as long as it thinks it's making some progress, your shutdown can take much longer than you expect in sufficiently perverse situations. Or rather I should say your reboot. As the manual page specifically notes, the watchdog shutdown timer only applies to reboots, not to powering the system off.

(One consequence of what ShutdownWatchdogSec does and doesn't cover is that for most systems it's safe to set it to a very low timeout. If you get to the systemd-shutdown stage with any processes left, so many things have already been shut down that those processes are probably not going to manage an orderly shutdown in any case. We currently use 30 seconds and that's probably far too generous.)

To use ShutdownWatchdogSec, you need a kernel watchdog timer; you can tell if you have one by looking for /dev/watchdog and /dev/watchdogN devices. Kernel watchdog timers are created by a variety of modules that support a variety of hardware watchdogs, such as iTCO_wdt for the Intel TCO WatchDog that you probably have on your Intel-based server hardware. For our purposes here, the simplest and easiest to use kernel watchdog module is softdog, a software watchdog implemented at the kernel level. Softdog has the limitation that it doesn't help if the kernel itself hangs, which we don't really care about, but the advantage that it works everywhere (including in VMs) and seems to be quite reliable and predictable.

Some Linux distributions (such as Fedora) automatically load an appropriate kernel watchdog module depending on what hardware is available. Ubuntu 16.04 goes to the other extreme; it extensively blacklists all kernel watchdog modules, softdog included, so you can't even stick something in /etc/modules-load.d. To elide a long discussion, our solution to this was a new cslab-softdog.service systemd service that explicitly loaded the module using the following:

[Service]
Type=oneshot
RemainAfterExit=True
ExecStart=/sbin/modprobe softdog

With softdog loaded and ShutdownWatchdogSec configured, systemd appears to reliably reboot my test VMs and test hardware in situations where systemd-shutdown previously hung. It takes somewhat longer than my configured ShutdownWatchdogSec, apparently because softdog gives you an extra margin of time just in case, probably 60 seconds based on what modinfo says.

Sidebar: Limiting total shutdown time (perhaps)

As noted in comments on my first entry on our reboot problems, reboot.target and poweroff.target both normally have a JobTimeoutSec of 30 minutes. If my understanding of systemd is correct, setting any JobTimeoutSec here is supposed to force a reboot or poweroff if the first stage of shutdown takes that long (because rebooting is done by attempting to active reboot.target, which is a systemd 'job', which causes the job timeout to matter).

Although I haven't tested it yet, this suggests that combining a suitably short short JobTimeoutSec on reboot.target with ShutdownWatchdogSec would limit the total time your system will ever spend rebooting. Picking a good JobTimeoutSec value is not obvious; you want it long enough that daemons have time to shut down in an orderly way, but not so long that you go off to the machine room. 30 minutes is clearly too long for us, but 30 seconds would probably be too short for most servers.


Comments on this page:

By StephenGregory at 2017-09-25 10:21:50:

We are seeing the same problem with CentOS. I am hoping we can use your workaround. May I copy your blog, fully attributed, and post to our internal wiki?

By cks at 2017-09-25 13:19:26:

Certainly. If you try the full version of the workaround (with JobTimeoutSec as well), I'd be interested to hear how well it works.

Written on 22 September 2017.
« My potential qualms about using Python 3 in projects
A clever way of killing groups of processes »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 22 02:28:17 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.