2017-09-29
More on systemd on Ubuntu 16.04 failing to reliably reboot some of our servers
I wrote about how Ubuntu 16.04 can't reliably reboot some of our servers, then discovered that systemd can shut down the network with NFS mounts still present and speculated this was (and is) one of our problems. I've now been able to reliably produce such a reboot failure on a test VM and narrow down the specific component involved.
Systemd shuts down your system in two stages;
the main stage that stops systemd units, and the final stage, done
with systemd-shutdown
,
which kills the remaining processes, fiddles around with the remaining mounts,
and theoretically eventually reboots the system. In the Ubuntu 16.04 version
of systemd-shutdown
, part of what it tries to do with NFS filesystems is
to remount them read-only, and for us this sometimes hangs. With suitable
logging enabled in systemd
so that
systemd-shutdown
is run with it, we see:
Sending SIGTERM to remaining processes... Sending SIGKILL to remaining processes... Sending SIGKILL to PID <nnn> (<command>) Unmounting file systems. Remounting '/var/mail' read-only with options '<many of them>'.
At this point things hang, although if you have it set up a shutdown
watchdog will force a reboot and recover
the system. Based on comments on my second entry, systemd-shutdown
doing this is (now)
seen as a problem and it's been changed in the upstream version of
systemd, although only very recently (eg this commit
only landed at the end of August).
Unfortunately this doesn't seem to be the sole cause of our shutdown
hangs. We appear to have had at least one reboot hang while systemd
attempts to swapoff
the server's swap space, before it enters
late-stage reboot. This particular server has a lot of inactive
user processes because it hosts our user-managed web servers, and (at the time) they weren't
being killed early in system shutdown, so
turning off swap space presumably had to page a lot of things back
into RAM. This may not have actually hung as such, but if so it was
sufficiently slow as to be unacceptable and we force-rebooted the
server in question after a minute or two.
We're currently using multiple ways to hopefully reduce the chances
of hangs at reboot times. We've put all user cron jobs into systemd
user slices so that systemd will kill
them early, although this doesn't always work
and we may need some clever way of dealing with the remaining
processes. We've enabled a shutdown watchdog timer with a relatively short timeout, although
this only helps if the system makes it to the second stage when it
runs systemd-shutdown
; a 'hang' before then in swapoff
won't
be interrupted.
In the future we may enable a relatively short JobTimeoutSec
on
reboot.target
, in the hopes that this does some good. I've
considered changing Ubuntu's cron.service
to KillMode=control-group
and then holding the package to prevent surprise carnage during
package upgrades, but this seems to be a little bit too much hassle
and danger for an infrequent thing that is generally merely irritating.
As a practical matter, this entry is probably the end of the saga.
This is not a particularly important thing for us and I've already
discovered that there are no simple, straightforward, bug-free fixes
(and as usual the odds are basically zero that
Ubuntu will fix bugs here). If we're lucky, Ubuntu 18.04 will include
a version of systemd with the systemd-shutdown
NFS mount fixes
in it and perhaps pam_systemd will be more reliable for @reboot
cron jobs. If we're not lucky, well, we'll keep having to trek down
to the machine room when we reboot servers. Fortunately it's not
something we do very often.
2017-09-27
Putting cron jobs into systemd user slices doesn't always work (on Ubuntu 16.04)
As part of dealing with our Ubuntu 16.04 shutdown problem, we now have our systems set up to put all user cron jobs into systemd user slices so that systemd will terminate them before it starts unmounting NFS filesystems. Since we made this change, we've rebooted all of our systems and thus had an opportunity to see how it works in practice in our environment.
Unfortunately, what we've discovered is that pam_systemd
apparently doesn't always work right. Specifically, we've seen some
user cron @reboot
entries create processes that wound up still
under cron.service
, although other @reboot
entries for the same
user on the same machine wound up with their processes in user
slices. When things fail, pam_systemd doesn't log any sort of
errors that I can see in the systemd journal.
(Since no failures are logged, this doesn't seem related to the famous systemd issue where pam_systemd can't talk to systemd, eg systemd issue 2863 or this Ubuntu issue.)
The pam_systemd source code
isn't very long and doesn't do very much itself. The most important
function here appears to be pam_sm_open_session
, and reading
the code I can't spot a failure path that doesn't cause pam_systemd
to log an error. The good news is that turning on debugging for
pam_systemd doesn't appear to result in an overwhelming volume
of extra messages, so we can probably do this on the machines where
we've seen the problem in the hopes that something useful shows up.
(It will probably take a while, since we don't reboot these machines very often. I have not seen or reproduced this on test machines, at least so far.)
Looking through what 'systemctl list-dependencies
' with various
options says for cron.service, it's possible that we need an explicit
dependency on systemd-logind.service
(although systemd-analyze
on one system says that systemd-logind started well before crond).
In theory it looks like pam_systemd should be reporting errors
if systemd-logind hasn't started, but in practice, who knows. We
might as well adopt a cargo cult 'better safe than sorry' approach
to unit dependencies, even if it feels like a very long shot.
(Life would be simpler if systemd had a simple way of discovering the relationship, if any, between two units.)
2017-09-22
Using a watchdog timer in system shutdown with systemd (on Ubuntu 16.04)
In Systemd, NFS mounts, and shutting down your system, I covered how Mike Kazantsev pointed me at the
ShutdownWatchdogSec
setting in system.conf
as a way of dealing with our reboot hang issues. I also alluded to
some issues with it. We've now tested and deployed a setup using
this, so I want to walk through how it works and what its limitations
are. As part of that I need to talk about how systemd actually shuts
down your system.
Under systemd, system shutdown happens in two stages. The first
stage is systemd stopping all of the system units that it can, in
whatever way or ways they're configured to stop. Some units may
fail to stop here and some processes may not be killed by their
unit's 'stop' action(s), for example processes run by cron
. This stage is the visible part of system
shutdown, the bit that causes systemd to print out all of its console
messages. It ends when systemd reaches shutdown.target
, which is
when you get console messages like:
[...] [ OK ] Stopped Remount Root and Kernel File Systems. [ OK ] Stopped Create Static Device Nodes in /dev. [ OK ] Reached target Shutdown.
(There are apparently a few more magic systemd targets and services that get invoked here without producing any console messages.)
The second stage starts when systemd transfers control (and being
PID 1) to the special systemd-shutdown
program in order to do the final cleanup and shutdown of the system
(the manual page describes why it exists and you can read the actual
core code here).
Simplified, systemd-shutdown
SIGTERM
s and then SIGKILL
s all
remaining processes and then enters a loop where it attempts to
unmount any remaining filesystems, deactivate any remaining swap
devices, and shut down remaining loop and DM devices. If all such
things are gone or systemd-shutdown
makes no progress at all, it
goes on to do the actual reboot. Unless you turn on systemd debugging
(and direct it to the console), systemd-shutdown
is completely
silent about all of this; it prints nothing when it starts and
nothing as it runs. Normally this doesn't matter because it finishes
immediately and without problems.
Based on the manpage, you might think that ShutdownWatchdogSec
limits the total amount of time a shutdown can take and thus covers
both of these stages. This is not the case; the only thing that
ShutdownWatchdogSec
does is put a watchdog timer on systemd-shutdown
's
end-of-things work in the second stage. Well, sort of. If you read
the manpage, you'd probably think that the time you configure here
is the time limit on the second stage as a whole, but actually it's
only the time limit on each of those 'try to clean up remaining
things' loops. systemd-shutdown
resets the watchdog every time
it starts a trip through the loop, so as long as it thinks it's
making some progress, your shutdown can take much longer than you
expect in sufficiently perverse situations. Or rather I should say
your reboot. As the manual page specifically notes, the watchdog
shutdown timer only applies to reboots, not to powering the system
off.
(One consequence of what ShutdownWatchdogSec
does and doesn't
cover is that for most systems it's safe to set it to a very low
timeout. If you get to the systemd-shutdown
stage with any processes
left, so many things have already been shut down that those processes
are probably not going to manage an orderly shutdown in any case.
We currently use 30 seconds and that's probably far too generous.)
To use ShutdownWatchdogSec
, you need a kernel watchdog timer; you
can tell if you have one by looking for /dev/watchdog
and
/dev/watchdogN
devices. Kernel watchdog timers are created by a
variety of modules that support a variety of hardware watchdogs,
such as iTCO_wdt
for the Intel TCO WatchDog
that you probably have on your Intel-based server hardware. For our
purposes here, the simplest and easiest to use kernel watchdog
module is softdog
, a software watchdog implemented at the kernel
level. Softdog has the limitation that it doesn't help if the kernel
itself hangs, which we don't really care about, but the advantage
that it works everywhere (including in VMs) and seems to be quite
reliable and predictable.
Some Linux distributions (such as Fedora) automatically load an
appropriate kernel watchdog module depending on what hardware is
available. Ubuntu 16.04 goes to the other extreme; it extensively
blacklists all kernel watchdog modules, softdog
included, so you
can't even stick something in /etc/modules-load.d
. To elide a
long discussion, our solution to this was a new cslab-softdog.service
systemd service that explicitly loaded the module using the following:
[Service] Type=oneshot RemainAfterExit=True ExecStart=/sbin/modprobe softdog
With softdog
loaded and ShutdownWatchdogSec
configured, systemd
appears to reliably reboot my test VMs and test hardware in situations
where systemd-shutdown
previously hung. It takes somewhat longer
than my configured ShutdownWatchdogSec
, apparently because softdog
gives you an extra margin of time just in case, probably 60 seconds
based on what modinfo
says.
Sidebar: Limiting total shutdown time (perhaps)
As noted in comments on my first entry on our reboot problems, reboot.target
and poweroff.target
both normally have a JobTimeoutSec
of 30 minutes. If my understanding
of systemd is correct, setting any JobTimeoutSec
here is supposed
to force a reboot or poweroff if the first stage of shutdown takes
that long (because rebooting is done by attempting to active
reboot.target
, which is a systemd 'job', which causes the job
timeout to matter).
Although I haven't tested it yet, this suggests that combining a
suitably short short JobTimeoutSec
on reboot.target
with
ShutdownWatchdogSec
would limit the total time your system will
ever spend rebooting. Picking a good JobTimeoutSec
value is not
obvious; you want it long enough that daemons have time to shut
down in an orderly way, but not so long that you go off to the
machine room. 30 minutes is clearly too long for us, but 30 seconds
would probably be too short for most servers.
2017-09-14
Sorting out systemd's system.conf
, user.conf
, and logind.conf
Here's a mistake that I've made more than once and that I'm going to try to get rid of by writing it down.
Systemd organizes running processes into a tree of, well, let's
call them units for now (mechanically they're control groups), which
partly manifests in the form of slice units. One
of the big divisions in this hierarchy is between processes involved
in services, which are put under system.slice
, and user session
processes, which are under user.slice
. There are many situations
where you would like to apply different settings to user processes
than to system ones, partly because these processes are fundamentally
different in several respects.
(For example, all services should normally have some way to explicitly stop them and this will normally do some sort of orderly shutdown of the service involved. User slices, sessions, and scopes have no such thing and thus no real concept of an 'orderly shutdown'; all you can do is hit them with various Unix signals until they go away. For user stuff, the orderly shutdown was generally supposed to happen when the user logged off.)
Systemd has two configuration files, system.conf
and user.conf
.
One of the things system.conf
can do is set global defaults for
all units and all processes, both system processes (things under
system.slice
) and user processes (things under user.slice
), for
example DefaultTimeoutStopSec
and DefaultCPUAccounting
. As
mentioned, there are plenty of times when you'd like to set or
change these things only for user processes. You would think that
systemd would provide a way to do this, and further if you're
irritated with systemd and not paying close attention, you might
think that user.conf
can be used to set these things just for
user processes. After all, surely systemd provides a way to do
this obvious thing and 'user' is right there in the file's name.
This is wrong.
What user.conf
is for is covered in the manpage for both files;
it sets these values for systemd user instances, which are per-user
systemd instances that the user can control and do things with.
Systemd user instances can be used for interesting things (see the
Arch wiki on them),
but I don't currently deal with any systems that use them actively
so they're not on my mind much.
(Both Ubuntu 16.04 and Fedora 26 do start systemd user instances for people, but I don't think anyone on our systems uses them for anything; right now, they're just there.)
If systemd ever allows you to set things like DefaultCPUAccounting
only for user processes, instead of globally, the place it might
wind up is logind.conf
, which
configures systemd-logind
,
which is the systemd bit that actually sets up user slices, sessions,
scopes, and so on (often in part through pam_systemd). This
seems a logical location to me because systemd-logind is where user
stuff is controlled in general and logind.conf
already has the
UserTasksMax
setting. I don't know if anything like this is being
contemplated by the systemd people, though, and there are alternate
approaches such as allowing user-${UID}.slice
slices to be templated
(although in the current setup, this would require renaming them to have
an @
in their name, eg user@${UID}.slice
).
(I'm sure this seems like a silly mistake to make, and it certainly sounds like it when I've written it out like this. All I can say is that I've already made this mistake at least twice that I can remember; the most recent time made it into an irritated tweet that exhibited my misunderstanding.)
2017-09-07
Systemd, NFS mounts, and shutting down your system
After writing about our systemd reboot problem, I decided that I was irritated enough to spend part of today trying to dig into the situation (partly because of all of the comments and reactions to my entry, since they raised good questions and suggestions). I don't have any definite answers, partly because it's quite hard to see the state of the system when this is happening, but I do have some observations and notes (and some potentially useful contributions from other people on Twitter).
Here is what I've seen during shutdowns:
Under some circumstances, systemd will fail to unmount a NFS filesystem because processes are holding it busy but will go on to take down networking.
This is a dangerous situation to wind up in. With networking down,
any attempt by a process to do anything on the filesystem will
almost certainly give you an unrecoverably hung process; it's waiting
for NFS IO to complete, but NFS IO requires a network to talk to
the server, and the network isn't there and isn't coming back.
Unfortunately it's disturbingly easy to wind up in this situation,
thanks to our friend cron
and other
similar things.
Systemd appears to terminate processes in user slices reasonably
early in the shutdown process, definitely before it starts trying
to unmount NFS filesystems. However, as we've seen, not all 'user' processes are under
user slices; some of them are hanging out in places like cron.service
and apache2.service
. Now, you might think that cron jobs and CGI
processes and so on should be killed when systemd shuts down cron
and Apache (which it normally does before unmounting NFS filesystems),
but unfortunately both cron and Apache are set to KillMode=process
,
where systemd only terminates the main process when it stops the
service. So all of your cron jobs, CGI processes, and so on linger
on until systemd gets around to killing them much later (I believe
as part of running systemd-shutdown
but I'm not sure).
(You can have this issue with any systemd service or daemon that
starts multiple processes but uses KillMode=process
. I believe
that all System V init scripts handled through systemd's backwards
compatibility implicitly run in
this mode; certainly Ubuntu 16.04's /etc/init.d/apache2
does.)
As it happens, our user-managed web servers
mostly get run from cron with @reboot
entries (as the simple
way to start on system boot). I suspect
that it's not a coincidence that our web server almost always
experiences a hang during reboots. We have another server that often
experiences reboot hangs, and there people use at
and atd
.
(The mere presence of lingering processes doesn't doom you, because
they might not try to do any (NFS) IO when systemd sends them a
SIGTERM
. However, any number of things may react to SIGTERM
by
trying to do cleanups, for example by writing out a database or a
log record, and if they are running from a NFS filesystem that is
now cut off from the network, this is a problem.)
All of this description sounds very neat and pat, but it's clearly not the full story because I can't consistently reproduce a shutdown hang although I can consistently create cut-off NFS mounts with not yet killed processes that are holding them busy (although I've got some more ideas to try). This gets me around to the things that don't work and one thing that might.
In comments, Alan noted that the
stock systemd poweroff.target
and reboot.target
both have 30
minute timeouts, after which they force a poweroff or a reboot.
Unfortunately these timeouts don't seem to be triggering in my
tests, for whatever reason; I left a hung VM
sitting there for well over half an hour at one point with its
reboot.target
timeout clearly not triggering.
On Twitter,
Mike Kazentsev mentioned that
system.conf
has a ShutdownWatchdogSec
option to use a hardware watchdog to
force a reboot if the system becomes sufficiently unresponsive.
Unfortunately this watchdog doesn't limit the total reboot time,
because systemd-shutdown pings it every time it loops around
trying to unmount filesystems and turn off swap space and so on.
As long as systemd-shutdown thinks it's making some progress,
the watchdog won't fire. Setting the watchdog low will protect
you against systemd-shutdown hanging, though, and that may be
worthwhile.
(Also, as I found out last year and then forgot until I painfully rediscovered it today, you can't easily reduce the timeout on user slices so that lingering processes in user slices are terminated faster on shutdown. This means that on many machines, you can be stuck with a more than 90 second shutdown time in general.)
Sidebar: The obvious brute force solution for us
As far as we know, our problems come from processes run by actual
real people, not from system daemons that are lingering around.
These users exist in a defined UID range, so it wouldn't be
particularly difficult to write a program that scanned /proc
for
not-yet-killed user processes and tried to terminate them all. We
could try to be creative about the ordering of this program during
shutdown (so it ran after systemd had already shut down as many
user scopes and slices as possible), or just run it based on
convenient dependencies and accept that it would kill processes
that systemd would clean up on its own.
2017-09-06
Systemd on Ubuntu 16.04 can't (or won't) reliably reboot your server
We just went through a periodic exercise of rebooting all of our Ubuntu servers in order to get up to date on kernels and so on. By now almost all of our servers are running Ubuntu 16.04, which means that they're using systemd. Unfortunately this gives us a real problem, because on Ubuntu 16.04, systemd won't reliably reboot your system. On some servers, usually the busiest and most important ones, the system will just stop during the shutdown process and sit there. And sit there. And sit there. Perhaps it would eventually recover after tens of minutes, but as mentioned these are generally our busiest and most important servers, so we're not exactly going to let them sit there to find out what happens eventually.
(There also probably isn't much point to finding out. It's unlikely that there's some miracle cure we can do ourselves, and making a bug report to Ubuntu is almost completely pointless since Ubuntu only fixes security issues and things that are actively on fire. My previous experience wasn't productive and produced no solutions from anyone.)
This goes well beyond my previous systemd reboot irritation. Reliably rebooting servers despite what users are doing to them is a fairly foundational thing, yet Ubuntu's systemd not only can't get this right but doesn't even tell us what's wrong (in the sense of 'what is keeping me from rebooting'). The net effect is to turn rebooting many of our servers into a minefield. Not only may a reboot require in-person intervention in our machine room, but that we can't count on a reboot just working means that we actively have to pay attention to the state of every machine when we reboot them; we can't just assume that machines will come back up on their own unless something is fairly wrong. The whole experience angers me every time I have to go through it.
By now we've enabled persistent systemd journals on most everything
in the hopes of capturing useful information so we can perhaps guess
why this is happening. Unfortunately so far we've gotten nothing
useful; systemd has yet to log or display on the screen, say, 'still
waiting N seconds for job X'. I'm not even convinced that the systemd
journal has captured all of the log messages that it should from
an unsuccessful shutdown, as what 'journalctl -b-1
' shows is much
less than I'd expect and just stops abruptly.
(Without an idea of how and why systemd is screwing up, I'm reluctant
to change DefaultTimeoutStopSec
from its Ubuntu default, as I
once discussed here, or make other
changes like forcing all user cron jobs to run under user slices.)
(This Ubuntu bug matches one set of symptoms we see, but not all of them. Note that our problem is definitely not the Linux kernel having problems rebooting the hardware; the same Dell servers were previously running Ubuntu 14.04 and rebooting fine, and Magic SysRQ will force reboots without problems. There's also this Ubuntu bug and this report of problems with shutting down when you have NFS mounts, which certainly could be part of our problems.)
2017-09-03
A fundamental limitation of systemd's per-user fair share scheduling
Up until now, I've been casually talking about systemd supporting
per-user fair share scheduling, when writing about the basic
mechanics and in things like getting
cron jobs to cooperate. But really both of
these point out a fundamental limitation, which is that systemd
doesn't have per-user fair share scheduling; what it really has is
per-slice fair share scheduling. You can create per-user fair
share scheduling from this only to the extent that you can arrange
for a given user's processes to all wind up somewhere under their
user-${UID}
slice. If you can't arrange for all of the significant
processes to get put under user-${UID}.slice
, you don't get
complete per-user fair share scheduling; some processes will escape
to be scheduled separately and possibly (very) unfairly.
This may sound like an abstract limitation, so let me give you a concrete case where it matters. We run a departmental web server, where users can run processes to handle web requests in various ways, both via CGIs and via user-managed web servers. Both of these can experience load surges of various sorts and sometimes this can result in them eating a bunch of CPU. It would be nice if user processes could have their CPU usage shared fairly among everyone, so that one user with a bunch of CPU-heavy requests wouldn't starve everyone else out of the CPU.
User-managed web servers run either from cron with @reboot
entries
or manually by the user logging in and (re)starting them; in both
cases we can arrange for the processes to be under user-${UID}.slice
and so be subject to per-user fair share scheduling. However, user
CGIs are run via suexec and suexec
doesn't use PAM (unlike cron); it just
directly changes UID to the target user. As a result, all suexec
CGI processes are found in apache2.service
under the system slice,
and so will never be part of per-user fair share scheduling.
(Even if you could make suexec use PAM and so set up systemd sessions for CGIs it runs if you wanted to, it's not clear that you'd want to be churning through that many session scopes and perhaps user slice creations and removals. I'm honestly not sure I'd trust systemd to be resilient in the face of creating huge numbers of very short-lived sessions, especially many at once if you get a load surge against some CGIs.)
As far as I can see, there's no way to solve this within the current
state of systemd, especially for the case of CGIs. Systemd would
probably need a whole new raft of features (likely including having
the user-${UID}.slice
linger around even with no processes under
it). Plus we'd need a new version of suexec that explicitly got
systemd to put new processes in the right slices (or used PAM so a
PAM module could do this).
Sidebar: This is also a general limitation of Linux
Linux has chosen to implement per-user fair share scheduling through a general mechanism to do fair share scheduling of (c)groups. Doing it this way has always required that you somehow arranged for all user processes to wind up in a per-user cgroup (whether through PAM modules, hand manipulation when creating processes, or a daemon that watched for processes that were in the wrong spot and moved them). If and when processes fell through the cracks, they wouldn't be scheduled appropriately. If anything, systemd makes it easier to get close to full per-user fair share scheduling than previous tools did.
2017-09-02
Putting cron jobs into systemd user slices
In my last installment on fair share scheduling with systemd and
Ubuntu 16.04, I succeeded in working
out how to get ordinary user processes (ones spawned from people
logging in or sshing in or the like) organized into the right cgroup
hierarchy so they would be subjected to per-user fair share scheduling.
However, I discovered and noted a limitation that is relevant for
our environment, which is that in a standard Ubuntu 16.04 system,
processes started by cron are not put into user slices; instead
they all run under the cron.service
system slice. A commentator
suggested that this could probably be fixed with the PAM systemd module, and I got
sufficiently interested in this to work out how to do it.
The important bit of PAM magic is the pam_systemd PAM
module. The manpage writeup implicitly focuses on actual login sessions
of some form (including ssh command execution), but in fact it works
for everything and does what you'd expect. If pam_systemd is one of
the session
modules, whatever 'session' is created through that PAM
service will put processes into a session scope inside a user-${UID}
slice that is itself under user.slice
. If general per-user fair
share scheduling is enabled, this will
cause these processes to be part of the user's fair-share scheduling.
(As the pam_systemd manpage implies in passing, this may also have some side effects depending on logind.conf settings. This may constrain your ability to use this for, say, cron jobs in some environments.)
One of the things that happens in our environment is that we run a
lot of root cron jobs for things that need to run frequently like
our password propagation system.
Unfortunately pam_systemd seems to cause a small burst of logging
every time it's used, at least on Ubuntu 16.04, so having root cron
jobs spawn new session scopes every time they run may be a pain
(and you may not want some of the side effects for root jobs, like
having them be per-user fair-share scheduled). Helpfully PAM provides
us a way around this via the pam_succeed_if module.
So we can put the following in /etc/pam.d/cron
to only force use
of systemd session scopes and user slices for cron jobs run by
actual users:
session [default=1 success=ignore] pam_succeed_if.so quiet uid > 999 session optional pam_systemd.so
(The normal starting user UID on Ubuntu 16.04 is UID 1000. Your local first user UID may be different, and I confess that ours certainly is.)
A daring person could put this in
/etc/pam.d/common-session-noninteractive
instead, which on a
standard Ubuntu 16.04 machine is included by the PAM files atd
,
cron
, samba
, sudo
, and systemd-user
(which is used when you
run 'systemd --user
', not that you normally do). Having looked
at this list, I think I would only put it in cron
and atd
.
(Yes, we have some users who (still) use at
.)
All of this implicitly exposes a fundamental limitation of systemd per-user fair share scheduling, but that's going to have to be another entry.