Wandering Thoughts archives

2017-09-29

More on systemd on Ubuntu 16.04 failing to reliably reboot some of our servers

I wrote about how Ubuntu 16.04 can't reliably reboot some of our servers, then discovered that systemd can shut down the network with NFS mounts still present and speculated this was (and is) one of our problems. I've now been able to reliably produce such a reboot failure on a test VM and narrow down the specific component involved.

Systemd shuts down your system in two stages; the main stage that stops systemd units, and the final stage, done with systemd-shutdown, which kills the remaining processes, fiddles around with the remaining mounts, and theoretically eventually reboots the system. In the Ubuntu 16.04 version of systemd-shutdown, part of what it tries to do with NFS filesystems is to remount them read-only, and for us this sometimes hangs. With suitable logging enabled in systemd so that systemd-shutdown is run with it, we see:

Sending SIGTERM to remaining processes...
Sending SIGKILL to remaining processes...
Sending SIGKILL to PID <nnn> (<command>)
Unmounting file systems.
Remounting '/var/mail' read-only with options '<many of them>'.

At this point things hang, although if you have it set up a shutdown watchdog will force a reboot and recover the system. Based on comments on my second entry, systemd-shutdown doing this is (now) seen as a problem and it's been changed in the upstream version of systemd, although only very recently (eg this commit only landed at the end of August).

Unfortunately this doesn't seem to be the sole cause of our shutdown hangs. We appear to have had at least one reboot hang while systemd attempts to swapoff the server's swap space, before it enters late-stage reboot. This particular server has a lot of inactive user processes because it hosts our user-managed web servers, and (at the time) they weren't being killed early in system shutdown, so turning off swap space presumably had to page a lot of things back into RAM. This may not have actually hung as such, but if so it was sufficiently slow as to be unacceptable and we force-rebooted the server in question after a minute or two.

We're currently using multiple ways to hopefully reduce the chances of hangs at reboot times. We've put all user cron jobs into systemd user slices so that systemd will kill them early, although this doesn't always work and we may need some clever way of dealing with the remaining processes. We've enabled a shutdown watchdog timer with a relatively short timeout, although this only helps if the system makes it to the second stage when it runs systemd-shutdown; a 'hang' before then in swapoff won't be interrupted.

In the future we may enable a relatively short JobTimeoutSec on reboot.target, in the hopes that this does some good. I've considered changing Ubuntu's cron.service to KillMode=control-group and then holding the package to prevent surprise carnage during package upgrades, but this seems to be a little bit too much hassle and danger for an infrequent thing that is generally merely irritating.

As a practical matter, this entry is probably the end of the saga. This is not a particularly important thing for us and I've already discovered that there are no simple, straightforward, bug-free fixes (and as usual the odds are basically zero that Ubuntu will fix bugs here). If we're lucky, Ubuntu 18.04 will include a version of systemd with the systemd-shutdown NFS mount fixes in it and perhaps pam_systemd will be more reliable for @reboot cron jobs. If we're not lucky, well, we'll keep having to trek down to the machine room when we reboot servers. Fortunately it's not something we do very often.

SystemdUbuntuRebootFailureII written at 00:35:45; Add Comment

2017-09-27

Putting cron jobs into systemd user slices doesn't always work (on Ubuntu 16.04)

As part of dealing with our Ubuntu 16.04 shutdown problem, we now have our systems set up to put all user cron jobs into systemd user slices so that systemd will terminate them before it starts unmounting NFS filesystems. Since we made this change, we've rebooted all of our systems and thus had an opportunity to see how it works in practice in our environment.

Unfortunately, what we've discovered is that pam_systemd apparently doesn't always work right. Specifically, we've seen some user cron @reboot entries create processes that wound up still under cron.service, although other @reboot entries for the same user on the same machine wound up with their processes in user slices. When things fail, pam_systemd doesn't log any sort of errors that I can see in the systemd journal.

(Since no failures are logged, this doesn't seem related to the famous systemd issue where pam_systemd can't talk to systemd, eg systemd issue 2863 or this Ubuntu issue.)

The pam_systemd source code isn't very long and doesn't do very much itself. The most important function here appears to be pam_sm_open_session, and reading the code I can't spot a failure path that doesn't cause pam_systemd to log an error. The good news is that turning on debugging for pam_systemd doesn't appear to result in an overwhelming volume of extra messages, so we can probably do this on the machines where we've seen the problem in the hopes that something useful shows up.

(It will probably take a while, since we don't reboot these machines very often. I have not seen or reproduced this on test machines, at least so far.)

Looking through what 'systemctl list-dependencies' with various options says for cron.service, it's possible that we need an explicit dependency on systemd-logind.service (although systemd-analyze on one system says that systemd-logind started well before crond). In theory it looks like pam_systemd should be reporting errors if systemd-logind hasn't started, but in practice, who knows. We might as well adopt a cargo cult 'better safe than sorry' approach to unit dependencies, even if it feels like a very long shot.

(Life would be simpler if systemd had a simple way of discovering the relationship, if any, between two units.)

SystemdCronUserSlicesII written at 23:58:12; Add Comment

2017-09-22

Using a watchdog timer in system shutdown with systemd (on Ubuntu 16.04)

In Systemd, NFS mounts, and shutting down your system, I covered how Mike Kazantsev pointed me at the ShutdownWatchdogSec setting in system.conf as a way of dealing with our reboot hang issues. I also alluded to some issues with it. We've now tested and deployed a setup using this, so I want to walk through how it works and what its limitations are. As part of that I need to talk about how systemd actually shuts down your system.

Under systemd, system shutdown happens in two stages. The first stage is systemd stopping all of the system units that it can, in whatever way or ways they're configured to stop. Some units may fail to stop here and some processes may not be killed by their unit's 'stop' action(s), for example processes run by cron. This stage is the visible part of system shutdown, the bit that causes systemd to print out all of its console messages. It ends when systemd reaches shutdown.target, which is when you get console messages like:

[...]
[ OK ] Stopped Remount Root and Kernel File Systems.
[ OK ] Stopped Create Static Device Nodes in /dev.
[ OK ] Reached target Shutdown.

(There are apparently a few more magic systemd targets and services that get invoked here without producing any console messages.)

The second stage starts when systemd transfers control (and being PID 1) to the special systemd-shutdown program in order to do the final cleanup and shutdown of the system (the manual page describes why it exists and you can read the actual core code here). Simplified, systemd-shutdown SIGTERMs and then SIGKILLs all remaining processes and then enters a loop where it attempts to unmount any remaining filesystems, deactivate any remaining swap devices, and shut down remaining loop and DM devices. If all such things are gone or systemd-shutdown makes no progress at all, it goes on to do the actual reboot. Unless you turn on systemd debugging (and direct it to the console), systemd-shutdown is completely silent about all of this; it prints nothing when it starts and nothing as it runs. Normally this doesn't matter because it finishes immediately and without problems.

Based on the manpage, you might think that ShutdownWatchdogSec limits the total amount of time a shutdown can take and thus covers both of these stages. This is not the case; the only thing that ShutdownWatchdogSec does is put a watchdog timer on systemd-shutdown's end-of-things work in the second stage. Well, sort of. If you read the manpage, you'd probably think that the time you configure here is the time limit on the second stage as a whole, but actually it's only the time limit on each of those 'try to clean up remaining things' loops. systemd-shutdown resets the watchdog every time it starts a trip through the loop, so as long as it thinks it's making some progress, your shutdown can take much longer than you expect in sufficiently perverse situations. Or rather I should say your reboot. As the manual page specifically notes, the watchdog shutdown timer only applies to reboots, not to powering the system off.

(One consequence of what ShutdownWatchdogSec does and doesn't cover is that for most systems it's safe to set it to a very low timeout. If you get to the systemd-shutdown stage with any processes left, so many things have already been shut down that those processes are probably not going to manage an orderly shutdown in any case. We currently use 30 seconds and that's probably far too generous.)

To use ShutdownWatchdogSec, you need a kernel watchdog timer; you can tell if you have one by looking for /dev/watchdog and /dev/watchdogN devices. Kernel watchdog timers are created by a variety of modules that support a variety of hardware watchdogs, such as iTCO_wdt for the Intel TCO WatchDog that you probably have on your Intel-based server hardware. For our purposes here, the simplest and easiest to use kernel watchdog module is softdog, a software watchdog implemented at the kernel level. Softdog has the limitation that it doesn't help if the kernel itself hangs, which we don't really care about, but the advantage that it works everywhere (including in VMs) and seems to be quite reliable and predictable.

Some Linux distributions (such as Fedora) automatically load an appropriate kernel watchdog module depending on what hardware is available. Ubuntu 16.04 goes to the other extreme; it extensively blacklists all kernel watchdog modules, softdog included, so you can't even stick something in /etc/modules-load.d. To elide a long discussion, our solution to this was a new cslab-softdog.service systemd service that explicitly loaded the module using the following:

[Service]
Type=oneshot
RemainAfterExit=True
ExecStart=/sbin/modprobe softdog

With softdog loaded and ShutdownWatchdogSec configured, systemd appears to reliably reboot my test VMs and test hardware in situations where systemd-shutdown previously hung. It takes somewhat longer than my configured ShutdownWatchdogSec, apparently because softdog gives you an extra margin of time just in case, probably 60 seconds based on what modinfo says.

Sidebar: Limiting total shutdown time (perhaps)

As noted in comments on my first entry on our reboot problems, reboot.target and poweroff.target both normally have a JobTimeoutSec of 30 minutes. If my understanding of systemd is correct, setting any JobTimeoutSec here is supposed to force a reboot or poweroff if the first stage of shutdown takes that long (because rebooting is done by attempting to active reboot.target, which is a systemd 'job', which causes the job timeout to matter).

Although I haven't tested it yet, this suggests that combining a suitably short short JobTimeoutSec on reboot.target with ShutdownWatchdogSec would limit the total time your system will ever spend rebooting. Picking a good JobTimeoutSec value is not obvious; you want it long enough that daemons have time to shut down in an orderly way, but not so long that you go off to the machine room. 30 minutes is clearly too long for us, but 30 seconds would probably be too short for most servers.

SystemdShutdownWatchdog written at 02:28:17; Add Comment

2017-09-14

Sorting out systemd's system.conf, user.conf, and logind.conf

Here's a mistake that I've made more than once and that I'm going to try to get rid of by writing it down.

Systemd organizes running processes into a tree of, well, let's call them units for now (mechanically they're control groups), which partly manifests in the form of slice units. One of the big divisions in this hierarchy is between processes involved in services, which are put under system.slice, and user session processes, which are under user.slice. There are many situations where you would like to apply different settings to user processes than to system ones, partly because these processes are fundamentally different in several respects.

(For example, all services should normally have some way to explicitly stop them and this will normally do some sort of orderly shutdown of the service involved. User slices, sessions, and scopes have no such thing and thus no real concept of an 'orderly shutdown'; all you can do is hit them with various Unix signals until they go away. For user stuff, the orderly shutdown was generally supposed to happen when the user logged off.)

Systemd has two configuration files, system.conf and user.conf. One of the things system.conf can do is set global defaults for all units and all processes, both system processes (things under system.slice) and user processes (things under user.slice), for example DefaultTimeoutStopSec and DefaultCPUAccounting. As mentioned, there are plenty of times when you'd like to set or change these things only for user processes. You would think that systemd would provide a way to do this, and further if you're irritated with systemd and not paying close attention, you might think that user.conf can be used to set these things just for user processes. After all, surely systemd provides a way to do this obvious thing and 'user' is right there in the file's name. This is wrong.

What user.conf is for is covered in the manpage for both files; it sets these values for systemd user instances, which are per-user systemd instances that the user can control and do things with. Systemd user instances can be used for interesting things (see the Arch wiki on them), but I don't currently deal with any systems that use them actively so they're not on my mind much.

(Both Ubuntu 16.04 and Fedora 26 do start systemd user instances for people, but I don't think anyone on our systems uses them for anything; right now, they're just there.)

If systemd ever allows you to set things like DefaultCPUAccounting only for user processes, instead of globally, the place it might wind up is logind.conf, which configures systemd-logind, which is the systemd bit that actually sets up user slices, sessions, scopes, and so on (often in part through pam_systemd). This seems a logical location to me because systemd-logind is where user stuff is controlled in general and logind.conf already has the UserTasksMax setting. I don't know if anything like this is being contemplated by the systemd people, though, and there are alternate approaches such as allowing user-${UID}.slice slices to be templated (although in the current setup, this would require renaming them to have an @ in their name, eg user@${UID}.slice).

(I'm sure this seems like a silly mistake to make, and it certainly sounds like it when I've written it out like this. All I can say is that I've already made this mistake at least twice that I can remember; the most recent time made it into an irritated tweet that exhibited my misunderstanding.)

SystemdUserAndSystemConf written at 00:34:52; Add Comment

2017-09-07

Systemd, NFS mounts, and shutting down your system

After writing about our systemd reboot problem, I decided that I was irritated enough to spend part of today trying to dig into the situation (partly because of all of the comments and reactions to my entry, since they raised good questions and suggestions). I don't have any definite answers, partly because it's quite hard to see the state of the system when this is happening, but I do have some observations and notes (and some potentially useful contributions from other people on Twitter).

Here is what I've seen during shutdowns:

Under some circumstances, systemd will fail to unmount a NFS filesystem because processes are holding it busy but will go on to take down networking.

This is a dangerous situation to wind up in. With networking down, any attempt by a process to do anything on the filesystem will almost certainly give you an unrecoverably hung process; it's waiting for NFS IO to complete, but NFS IO requires a network to talk to the server, and the network isn't there and isn't coming back. Unfortunately it's disturbingly easy to wind up in this situation, thanks to our friend cron and other similar things.

Systemd appears to terminate processes in user slices reasonably early in the shutdown process, definitely before it starts trying to unmount NFS filesystems. However, as we've seen, not all 'user' processes are under user slices; some of them are hanging out in places like cron.service and apache2.service. Now, you might think that cron jobs and CGI processes and so on should be killed when systemd shuts down cron and Apache (which it normally does before unmounting NFS filesystems), but unfortunately both cron and Apache are set to KillMode=process, where systemd only terminates the main process when it stops the service. So all of your cron jobs, CGI processes, and so on linger on until systemd gets around to killing them much later (I believe as part of running systemd-shutdown but I'm not sure).

(You can have this issue with any systemd service or daemon that starts multiple processes but uses KillMode=process. I believe that all System V init scripts handled through systemd's backwards compatibility implicitly run in this mode; certainly Ubuntu 16.04's /etc/init.d/apache2 does.)

As it happens, our user-managed web servers mostly get run from cron with @reboot entries (as the simple way to start on system boot). I suspect that it's not a coincidence that our web server almost always experiences a hang during reboots. We have another server that often experiences reboot hangs, and there people use at and atd.

(The mere presence of lingering processes doesn't doom you, because they might not try to do any (NFS) IO when systemd sends them a SIGTERM. However, any number of things may react to SIGTERM by trying to do cleanups, for example by writing out a database or a log record, and if they are running from a NFS filesystem that is now cut off from the network, this is a problem.)

All of this description sounds very neat and pat, but it's clearly not the full story because I can't consistently reproduce a shutdown hang although I can consistently create cut-off NFS mounts with not yet killed processes that are holding them busy (although I've got some more ideas to try). This gets me around to the things that don't work and one thing that might.

In comments, Alan noted that the stock systemd poweroff.target and reboot.target both have 30 minute timeouts, after which they force a poweroff or a reboot. Unfortunately these timeouts don't seem to be triggering in my tests, for whatever reason; I left a hung VM sitting there for well over half an hour at one point with its reboot.target timeout clearly not triggering.

On Twitter, Mike Kazentsev mentioned that system.conf has a ShutdownWatchdogSec option to use a hardware watchdog to force a reboot if the system becomes sufficiently unresponsive. Unfortunately this watchdog doesn't limit the total reboot time, because systemd-shutdown pings it every time it loops around trying to unmount filesystems and turn off swap space and so on. As long as systemd-shutdown thinks it's making some progress, the watchdog won't fire. Setting the watchdog low will protect you against systemd-shutdown hanging, though, and that may be worthwhile.

(Also, as I found out last year and then forgot until I painfully rediscovered it today, you can't easily reduce the timeout on user slices so that lingering processes in user slices are terminated faster on shutdown. This means that on many machines, you can be stuck with a more than 90 second shutdown time in general.)

Sidebar: The obvious brute force solution for us

As far as we know, our problems come from processes run by actual real people, not from system daemons that are lingering around. These users exist in a defined UID range, so it wouldn't be particularly difficult to write a program that scanned /proc for not-yet-killed user processes and tried to terminate them all. We could try to be creative about the ordering of this program during shutdown (so it ran after systemd had already shut down as many user scopes and slices as possible), or just run it based on convenient dependencies and accept that it would kill processes that systemd would clean up on its own.

SystemdNFSMountShutdown written at 01:28:14; Add Comment

2017-09-06

Systemd on Ubuntu 16.04 can't (or won't) reliably reboot your server

We just went through a periodic exercise of rebooting all of our Ubuntu servers in order to get up to date on kernels and so on. By now almost all of our servers are running Ubuntu 16.04, which means that they're using systemd. Unfortunately this gives us a real problem, because on Ubuntu 16.04, systemd won't reliably reboot your system. On some servers, usually the busiest and most important ones, the system will just stop during the shutdown process and sit there. And sit there. And sit there. Perhaps it would eventually recover after tens of minutes, but as mentioned these are generally our busiest and most important servers, so we're not exactly going to let them sit there to find out what happens eventually.

(There also probably isn't much point to finding out. It's unlikely that there's some miracle cure we can do ourselves, and making a bug report to Ubuntu is almost completely pointless since Ubuntu only fixes security issues and things that are actively on fire. My previous experience wasn't productive and produced no solutions from anyone.)

This goes well beyond my previous systemd reboot irritation. Reliably rebooting servers despite what users are doing to them is a fairly foundational thing, yet Ubuntu's systemd not only can't get this right but doesn't even tell us what's wrong (in the sense of 'what is keeping me from rebooting'). The net effect is to turn rebooting many of our servers into a minefield. Not only may a reboot require in-person intervention in our machine room, but that we can't count on a reboot just working means that we actively have to pay attention to the state of every machine when we reboot them; we can't just assume that machines will come back up on their own unless something is fairly wrong. The whole experience angers me every time I have to go through it.

By now we've enabled persistent systemd journals on most everything in the hopes of capturing useful information so we can perhaps guess why this is happening. Unfortunately so far we've gotten nothing useful; systemd has yet to log or display on the screen, say, 'still waiting N seconds for job X'. I'm not even convinced that the systemd journal has captured all of the log messages that it should from an unsuccessful shutdown, as what 'journalctl -b-1' shows is much less than I'd expect and just stops abruptly.

(Without an idea of how and why systemd is screwing up, I'm reluctant to change DefaultTimeoutStopSec from its Ubuntu default, as I once discussed here, or make other changes like forcing all user cron jobs to run under user slices.)

(This Ubuntu bug matches one set of symptoms we see, but not all of them. Note that our problem is definitely not the Linux kernel having problems rebooting the hardware; the same Dell servers were previously running Ubuntu 14.04 and rebooting fine, and Magic SysRQ will force reboots without problems. There's also this Ubuntu bug and this report of problems with shutting down when you have NFS mounts, which certainly could be part of our problems.)

SystemdUbuntuRebootFailure written at 02:27:09; Add Comment

2017-09-03

A fundamental limitation of systemd's per-user fair share scheduling

Up until now, I've been casually talking about systemd supporting per-user fair share scheduling, when writing about the basic mechanics and in things like getting cron jobs to cooperate. But really both of these point out a fundamental limitation, which is that systemd doesn't have per-user fair share scheduling; what it really has is per-slice fair share scheduling. You can create per-user fair share scheduling from this only to the extent that you can arrange for a given user's processes to all wind up somewhere under their user-${UID} slice. If you can't arrange for all of the significant processes to get put under user-${UID}.slice, you don't get complete per-user fair share scheduling; some processes will escape to be scheduled separately and possibly (very) unfairly.

This may sound like an abstract limitation, so let me give you a concrete case where it matters. We run a departmental web server, where users can run processes to handle web requests in various ways, both via CGIs and via user-managed web servers. Both of these can experience load surges of various sorts and sometimes this can result in them eating a bunch of CPU. It would be nice if user processes could have their CPU usage shared fairly among everyone, so that one user with a bunch of CPU-heavy requests wouldn't starve everyone else out of the CPU.

User-managed web servers run either from cron with @reboot entries or manually by the user logging in and (re)starting them; in both cases we can arrange for the processes to be under user-${UID}.slice and so be subject to per-user fair share scheduling. However, user CGIs are run via suexec and suexec doesn't use PAM (unlike cron); it just directly changes UID to the target user. As a result, all suexec CGI processes are found in apache2.service under the system slice, and so will never be part of per-user fair share scheduling.

(Even if you could make suexec use PAM and so set up systemd sessions for CGIs it runs if you wanted to, it's not clear that you'd want to be churning through that many session scopes and perhaps user slice creations and removals. I'm honestly not sure I'd trust systemd to be resilient in the face of creating huge numbers of very short-lived sessions, especially many at once if you get a load surge against some CGIs.)

As far as I can see, there's no way to solve this within the current state of systemd, especially for the case of CGIs. Systemd would probably need a whole new raft of features (likely including having the user-${UID}.slice linger around even with no processes under it). Plus we'd need a new version of suexec that explicitly got systemd to put new processes in the right slices (or used PAM so a PAM module could do this).

Sidebar: This is also a general limitation of Linux

Linux has chosen to implement per-user fair share scheduling through a general mechanism to do fair share scheduling of (c)groups. Doing it this way has always required that you somehow arranged for all user processes to wind up in a per-user cgroup (whether through PAM modules, hand manipulation when creating processes, or a daemon that watched for processes that were in the wrong spot and moved them). If and when processes fell through the cracks, they wouldn't be scheduled appropriately. If anything, systemd makes it easier to get close to full per-user fair share scheduling than previous tools did.

SystemdFairShareLimitation written at 01:40:44; Add Comment

2017-09-02

Putting cron jobs into systemd user slices

In my last installment on fair share scheduling with systemd and Ubuntu 16.04, I succeeded in working out how to get ordinary user processes (ones spawned from people logging in or sshing in or the like) organized into the right cgroup hierarchy so they would be subjected to per-user fair share scheduling. However, I discovered and noted a limitation that is relevant for our environment, which is that in a standard Ubuntu 16.04 system, processes started by cron are not put into user slices; instead they all run under the cron.service system slice. A commentator suggested that this could probably be fixed with the PAM systemd module, and I got sufficiently interested in this to work out how to do it.

The important bit of PAM magic is the pam_systemd PAM module. The manpage writeup implicitly focuses on actual login sessions of some form (including ssh command execution), but in fact it works for everything and does what you'd expect. If pam_systemd is one of the session modules, whatever 'session' is created through that PAM service will put processes into a session scope inside a user-${UID} slice that is itself under user.slice. If general per-user fair share scheduling is enabled, this will cause these processes to be part of the user's fair-share scheduling.

(As the pam_systemd manpage implies in passing, this may also have some side effects depending on logind.conf settings. This may constrain your ability to use this for, say, cron jobs in some environments.)

One of the things that happens in our environment is that we run a lot of root cron jobs for things that need to run frequently like our password propagation system. Unfortunately pam_systemd seems to cause a small burst of logging every time it's used, at least on Ubuntu 16.04, so having root cron jobs spawn new session scopes every time they run may be a pain (and you may not want some of the side effects for root jobs, like having them be per-user fair-share scheduled). Helpfully PAM provides us a way around this via the pam_succeed_if module. So we can put the following in /etc/pam.d/cron to only force use of systemd session scopes and user slices for cron jobs run by actual users:

session [default=1 success=ignore] pam_succeed_if.so quiet uid > 999
session optional     pam_systemd.so

(The normal starting user UID on Ubuntu 16.04 is UID 1000. Your local first user UID may be different, and I confess that ours certainly is.)

A daring person could put this in /etc/pam.d/common-session-noninteractive instead, which on a standard Ubuntu 16.04 machine is included by the PAM files atd, cron, samba, sudo, and systemd-user (which is used when you run 'systemd --user', not that you normally do). Having looked at this list, I think I would only put it in cron and atd.

(Yes, we have some users who (still) use at.)

All of this implicitly exposes a fundamental limitation of systemd per-user fair share scheduling, but that's going to have to be another entry.

SystemdCronUserSlices written at 00:42:08; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.