Wandering Thoughts

2017-10-16

Getting ssh-agent working in Fedora 26's Cinnamon desktop environment

I tweeted:

I have just been through an extensive yak-shaving exercise to use ssh-agent with Cinnamon and have it actually work reliably on Fedora 26.

The first question you might ask is why even use ssh-agent instead of the default of gnome-keyring-daemon. That's straightforward; gnome-keyring-daemon still doesn't support ed25519 keys, despite a very long standing open bug about it (and another bug for ECDSA keys). I'm also not sure if current versions support Yubikey-based SSH keys, which I also care about, and apparently there are other issues with it.

(One charming detail from the GNOME ed25519 bug is that apparently there is no maintainer for either gnome-keyring-daemon as a whole or perhaps just the SSH keys portions of it. This situation doesn't inspire any great fondness in me for gnome-keyring-daemon, to put it one way.)

In Fedora 26, I ran into two problems with my previously-working ssh-agent environment. The first problem is that gnome-terminal doesn't inherit the correct $SSH_AUTH_SOCK setting, even if it's set in the general environment and is seen by other programs in my Cinnamon environment. The core problem seems to be that these days, all your gnome-terminal windows are actually created by a single master process, and in Fedora 26 this is started through a separate systemd user .service. I don't know how that service is supposed to inherit environment variables, but it doesn't get the correct $SSH_AUTH_SOCK; instead it always winds up with /run/user/${UID}/keyring/ssh, which is the gnome-keyring-daemon setting. My solution to this is pretty brute force; I added a little stanza to my session setup script that symlinked this path to the real $SSH_AUTH_SOCK.

(This implies that other systemd user .service units also probably have the wrong $SSH_AUTH_SOCK value, but they're all 'fixed' by my hack.)

The larger issue is that a ssh-agent process was only started the first time I logged in after system reboot. If I logged out and then logged back in again, my session had a $SSH_AUTH_SOCK value set but no ssh-agent process. In fact, it had the first session's $SSH_AUTH_SOCK value, which pointed to a socket that no longer existed because it had been cleaned up on session exit. I'm not sure what causes this, but I have noticed that there are a whole collection of systemd user .service units under user@${UID}.service that linger around even after I've logged out of my session. It certainly appears that while these exist, new Cinnamon sessions inherit the old session's $SSH_AUTH_SOCK value. This inheritance is a problem because of a snippet in /etc/X11/xinit/xinitrc-common:

if [ -z "$SSH_AGENT" ] && [ -z "$SSH_AUTH_SOCK" ] && [ -z "$SSH_AGENT_PID" ] && [ -x /usr/bin/ssh-agent ]; then
    if [ "x$TMPDIR" != "x" ]; then
        SSH_AGENT="/usr/bin/ssh-agent /bin/env TMPDIR=$TMPDIR"
    else
        SSH_AGENT="/usr/bin/ssh-agent"
fi

This starts ssh-agent only if $SSH_AUTH_SOCK is unset. If it's set to a bad value, no new ssh-agent is started and your entire session inherits the bad value and nothing works. My workaround was to change xinitrc-common to clear $SSH_AUTH_SOCK and all associated environment variables if it was set but pointed to something that didn't exist:

if [ -n "$SSH_AUTH_SOCK" ] && [ ! -S "$SSH_AUTH_SOCK" ]; then
   unset SSH_AGENT
   unset SSH_AUTH_SOCK
   unset SSH_AGENT_PID
fi

This appears to make everything work.

After I had worked all of this out and set it up, Jordan Sissel shared a much simpler workaround:

I used a oneliner that would kill gnome-keyring and replace it with ssh-agent on the same $SSH_AUTH_SOCK :\ Super annoying, though.

If I was doing this I wouldn't kill gnome-keyring-daemon entirely; I would just make my session startup script run a ssh-agent on /run/user/${UID}/keyring/ssh (using ssh-agent's -a command line argument).

(It's likely that gnome-keyring-daemon does other magic things that my Cinnamon session cares about. I'd rather not find out what other bits break if it's not running, or have it restart on me and perhaps take over the SSH agent socket again.)

PS: I'd file bug reports with Fedora except that I suspect they'd consider this an unsupported environment, and my track record with Fedora bug reports is not great in general. And filing bug reports with Fedora against gnome-keyring-daemon is pointless; if it's not getting action upstream, there's not much Fedora can do about it.

Fedora26CinnamonSSHAgent written at 00:04:20; Add Comment

2017-10-15

Unbalanced reads from SSDs in software RAID mirrors in Linux

When I was looking at the write volume figures for yesterday's entry, one additional thing that jumped out at me is that on our central mail server, reads were very unbalanced between its two system SSDs. This machine, as with many of our important servers, has a pair of SSDs set up as mirrors with Linux software RAID. In theory I'd expect reads to be about evenly distributed across each side of the mirror; in practice, well:

242 Total_LBAs_Read [...]  16838224623
242 Total_LBAs_Read [...]  1698394290

That's almost a factor of ten difference. Over 90% of the reads have gone to the first SSD, and it's not an anomaly or a one-time thing; I could watch live IO rates and see that much of the time only the first disk experienced any read traffic.

It turns out that this is more or less expected behavior in Linux software RAID, especially on SSDs, and has been for a while. It appears that the core change for this was made to the software RAID code in 2012, and then an important related change was made in late 2016 (and may not be in long-term distribution kernels). The current state of RAID1 read balancing is kind of complex, but the important thing here in all kernels since 2012 is that if you have SSDs and at least one disk is idle, the first idle disk will be chosen. In general the read balancing code will use the (first) disk with the least pending IO, so the case of idle disks is just the limit case.

(In kernels with the late 2016 change, this widens to if at least one disk is idle, the first idle disk will be chosen, even if all mirrors are HDs.)

SSDs are very fast in general and they have no seek delays for non-sequential IO. The result is that under casual read loads, most of the time both SSDs in a mirror are idle and so the RAID1 read balancing code will always choose to read from the first SSD. Reads spill over to the second SSD only if the first SSD is already handling a read at the time that an unrelated second read comes in. As we can see here, that doesn't happen all that frequently.

(Although our central mail server is an outlier as far as how unbalanced it is, other servers with mirrored SSDs also have unbalanced reads with the first disk in the mirror seeing far more than the second disk.)

UnbalancedSSDMirrorReads written at 02:39:17; Add Comment

2017-10-12

I'm looking forward to using systemd's new IP access control features

These days, my reaction to hearing about new systemd features is usually somewhere between indifference and irritation (I'm going to avoid giving examples, for various reasons). The new IP access lists feature is a rare exception; as a sysadmin, I'm actually reasonably enthused about it. What makes systemd's version of IP access restrictions special and interesting is that they can be be imposed per service, not just globally (and socket units having different IP access restrictions than the service implementing them adds extra possibilities).

As a sysadmin, I not infrequently deal with services that either use random ports by default (such as many NFS related programs) or which have an irritating habit of opening up 'control' ports that provide extra access to themselves (looking at what processes are listening on what ports on a typical modern machine can be eye-opening and alarming, especially since many programs don't document their port usage). Dealing with this with general iptables rules is generally too much work to be worth it, even when things don't go wrong; you have to chase down programs, try to configure some of them to use specific ports, hope that the other ports you're blocking are fixed and aren't going to change, and so on.

Because systemd can do these IP access controls on a per service basis, it promises a way out from all of this hassle. With per-service IP access controls, I can easily configure my NFS services so that regardless of what ports they decide to wander off and use, they're only going to be accessible to our NFS clients (or servers, for client machines). Other services can be locked down so that even if they go wild and decide to open up random control ports, nothing is going to happen because no one can talk to them. And the ability to set separate IP access controls on .socket units and .service units opens up the possibility of doing something close to per-port access control for specific services. CUPS already uses socket activation on our Ubuntu 16.04 machines, so we could configure the IPP port to be generally accessible but then lock down the CUPS .service and daemon so we don't have to worry that someday it will sprout an accessible control port somewhere.

(There are also uses for denying outbound traffic to some or many destinations but only for some services. This is much harder to do with iptables, and sometimes not possible at all.)

SystemdComingIPAccessControl written at 01:15:16; Add Comment

2017-10-02

My experience with using Fedora 26's standard font rendering (and fonts)

A bit over a month ago I wrote about my font rendering dilemma in Fedora 26, where my fontconfig user tweaks basically stopped working and I considered switching to the standard FreeType rendering rather than try to fix them. Leah Neukirchen solved one side of the dilemma for me on the spot in the comments, by telling me how to force FreeType to revert to my Fedora 25 rendering, but in the end I decided to stay with the standard system rendering as an experiment. At this point I consider the results of the experiment to be in, and I think the standard system rendering is the better, more readable one.

I rapidly got used to the new look of my xterms and so on, as I expected that I would. Some of our older systems are still using older FreeType versions and on these, even a default font rendering comes out basically the same as my old Fedora 25 one. On the infrequent occasions that I use these systems, their xterms now both look odd to me and also seem to be less easily readable than the regular xterms beside them with the darker, thicker font rendering from modern FreeType versions. This is only anecdotal, but looking at the old rendering periodically makes me happier to have switched to FreeType's modern rendering. I feel that I made the right choice.

The comments on my original article pointed me to this article on FreeType's new v40 interpreter; this interpreter change is the difference between Fedora 25's rendering and Fedora 26's. That article caused a cascade of yak shaving when I decided to switch to Fedora 26's standard rendering, because it got me to change my Firefox from using Georgia (at 16 points, I believe) to using the Fedora standard sans serif font at 15 points. This change in fonts and font sizes has wound up with me shuffling around the text zoom level on any number of sites, and not always in predictable directions. Some sites that I had increased the size on now don't need it any more; other sites now need it when they didn't need it before. The result is probably more readable, partly because I've been biased towards 'if in doubt, increase the text size'.

(A huge number of websites believe in tiny fonts for reasons that I don't understand. It's certainly not good typography, since the websites of typographers and many design people that I've seen tend to have fairly large type sizes, larger even than I'd pick.)

Although I haven't dug into it in depth, my impression is that this FreeType font rendering change has caused a number of other programs to change their text sizing and text rendering. I think Chrome now uses slightly different text sizes on web pages, for example; perhaps the FreeType v40 engine spaces things slightly differently. Or perhaps I'm just less willing to accept marginally small font sizes these days, so I'm being more picky.

(I may need to reset font preferences in other programs, such as Chrome, as I probably set any number of things to use Georgia a long time ago. For a while it was my default proportional spaced font, especially for web related things.)

Fedora26StandardFontRendering written at 01:22:30; Add Comment

2017-09-29

More on systemd on Ubuntu 16.04 failing to reliably reboot some of our servers

I wrote about how Ubuntu 16.04 can't reliably reboot some of our servers, then discovered that systemd can shut down the network with NFS mounts still present and speculated this was (and is) one of our problems. I've now been able to reliably produce such a reboot failure on a test VM and narrow down the specific component involved.

Systemd shuts down your system in two stages; the main stage that stops systemd units, and the final stage, done with systemd-shutdown, which kills the remaining processes, fiddles around with the remaining mounts, and theoretically eventually reboots the system. In the Ubuntu 16.04 version of systemd-shutdown, part of what it tries to do with NFS filesystems is to remount them read-only, and for us this sometimes hangs. With suitable logging enabled in systemd so that systemd-shutdown is run with it, we see:

Sending SIGTERM to remaining processes...
Sending SIGKILL to remaining processes...
Sending SIGKILL to PID <nnn> (<command>)
Unmounting file systems.
Remounting '/var/mail' read-only with options '<many of them>'.

At this point things hang, although if you have it set up a shutdown watchdog will force a reboot and recover the system. Based on comments on my second entry, systemd-shutdown doing this is (now) seen as a problem and it's been changed in the upstream version of systemd, although only very recently (eg this commit only landed at the end of August).

Unfortunately this doesn't seem to be the sole cause of our shutdown hangs. We appear to have had at least one reboot hang while systemd attempts to swapoff the server's swap space, before it enters late-stage reboot. This particular server has a lot of inactive user processes because it hosts our user-managed web servers, and (at the time) they weren't being killed early in system shutdown, so turning off swap space presumably had to page a lot of things back into RAM. This may not have actually hung as such, but if so it was sufficiently slow as to be unacceptable and we force-rebooted the server in question after a minute or two.

We're currently using multiple ways to hopefully reduce the chances of hangs at reboot times. We've put all user cron jobs into systemd user slices so that systemd will kill them early, although this doesn't always work and we may need some clever way of dealing with the remaining processes. We've enabled a shutdown watchdog timer with a relatively short timeout, although this only helps if the system makes it to the second stage when it runs systemd-shutdown; a 'hang' before then in swapoff won't be interrupted.

In the future we may enable a relatively short JobTimeoutSec on reboot.target, in the hopes that this does some good. I've considered changing Ubuntu's cron.service to KillMode=control-group and then holding the package to prevent surprise carnage during package upgrades, but this seems to be a little bit too much hassle and danger for an infrequent thing that is generally merely irritating.

As a practical matter, this entry is probably the end of the saga. This is not a particularly important thing for us and I've already discovered that there are no simple, straightforward, bug-free fixes (and as usual the odds are basically zero that Ubuntu will fix bugs here). If we're lucky, Ubuntu 18.04 will include a version of systemd with the systemd-shutdown NFS mount fixes in it and perhaps pam_systemd will be more reliable for @reboot cron jobs. If we're not lucky, well, we'll keep having to trek down to the machine room when we reboot servers. Fortunately it's not something we do very often.

SystemdUbuntuRebootFailureII written at 00:35:45; Add Comment

2017-09-27

Putting cron jobs into systemd user slices doesn't always work (on Ubuntu 16.04)

As part of dealing with our Ubuntu 16.04 shutdown problem, we now have our systems set up to put all user cron jobs into systemd user slices so that systemd will terminate them before it starts unmounting NFS filesystems. Since we made this change, we've rebooted all of our systems and thus had an opportunity to see how it works in practice in our environment.

Unfortunately, what we've discovered is that pam_systemd apparently doesn't always work right. Specifically, we've seen some user cron @reboot entries create processes that wound up still under cron.service, although other @reboot entries for the same user on the same machine wound up with their processes in user slices. When things fail, pam_systemd doesn't log any sort of errors that I can see in the systemd journal.

(Since no failures are logged, this doesn't seem related to the famous systemd issue where pam_systemd can't talk to systemd, eg systemd issue 2863 or this Ubuntu issue.)

The pam_systemd source code isn't very long and doesn't do very much itself. The most important function here appears to be pam_sm_open_session, and reading the code I can't spot a failure path that doesn't cause pam_systemd to log an error. The good news is that turning on debugging for pam_systemd doesn't appear to result in an overwhelming volume of extra messages, so we can probably do this on the machines where we've seen the problem in the hopes that something useful shows up.

(It will probably take a while, since we don't reboot these machines very often. I have not seen or reproduced this on test machines, at least so far.)

Looking through what 'systemctl list-dependencies' with various options says for cron.service, it's possible that we need an explicit dependency on systemd-logind.service (although systemd-analyze on one system says that systemd-logind started well before crond). In theory it looks like pam_systemd should be reporting errors if systemd-logind hasn't started, but in practice, who knows. We might as well adopt a cargo cult 'better safe than sorry' approach to unit dependencies, even if it feels like a very long shot.

(Life would be simpler if systemd had a simple way of discovering the relationship, if any, between two units.)

SystemdCronUserSlicesII written at 23:58:12; Add Comment

2017-09-22

Using a watchdog timer in system shutdown with systemd (on Ubuntu 16.04)

In Systemd, NFS mounts, and shutting down your system, I covered how Mike Kazantsev pointed me at the ShutdownWatchdogSec setting in system.conf as a way of dealing with our reboot hang issues. I also alluded to some issues with it. We've now tested and deployed a setup using this, so I want to walk through how it works and what its limitations are. As part of that I need to talk about how systemd actually shuts down your system.

Under systemd, system shutdown happens in two stages. The first stage is systemd stopping all of the system units that it can, in whatever way or ways they're configured to stop. Some units may fail to stop here and some processes may not be killed by their unit's 'stop' action(s), for example processes run by cron. This stage is the visible part of system shutdown, the bit that causes systemd to print out all of its console messages. It ends when systemd reaches shutdown.target, which is when you get console messages like:

[...]
[ OK ] Stopped Remount Root and Kernel File Systems.
[ OK ] Stopped Create Static Device Nodes in /dev.
[ OK ] Reached target Shutdown.

(There are apparently a few more magic systemd targets and services that get invoked here without producing any console messages.)

The second stage starts when systemd transfers control (and being PID 1) to the special systemd-shutdown program in order to do the final cleanup and shutdown of the system (the manual page describes why it exists and you can read the actual core code here). Simplified, systemd-shutdown SIGTERMs and then SIGKILLs all remaining processes and then enters a loop where it attempts to unmount any remaining filesystems, deactivate any remaining swap devices, and shut down remaining loop and DM devices. If all such things are gone or systemd-shutdown makes no progress at all, it goes on to do the actual reboot. Unless you turn on systemd debugging (and direct it to the console), systemd-shutdown is completely silent about all of this; it prints nothing when it starts and nothing as it runs. Normally this doesn't matter because it finishes immediately and without problems.

Based on the manpage, you might think that ShutdownWatchdogSec limits the total amount of time a shutdown can take and thus covers both of these stages. This is not the case; the only thing that ShutdownWatchdogSec does is put a watchdog timer on systemd-shutdown's end-of-things work in the second stage. Well, sort of. If you read the manpage, you'd probably think that the time you configure here is the time limit on the second stage as a whole, but actually it's only the time limit on each of those 'try to clean up remaining things' loops. systemd-shutdown resets the watchdog every time it starts a trip through the loop, so as long as it thinks it's making some progress, your shutdown can take much longer than you expect in sufficiently perverse situations. Or rather I should say your reboot. As the manual page specifically notes, the watchdog shutdown timer only applies to reboots, not to powering the system off.

(One consequence of what ShutdownWatchdogSec does and doesn't cover is that for most systems it's safe to set it to a very low timeout. If you get to the systemd-shutdown stage with any processes left, so many things have already been shut down that those processes are probably not going to manage an orderly shutdown in any case. We currently use 30 seconds and that's probably far too generous.)

To use ShutdownWatchdogSec, you need a kernel watchdog timer; you can tell if you have one by looking for /dev/watchdog and /dev/watchdogN devices. Kernel watchdog timers are created by a variety of modules that support a variety of hardware watchdogs, such as iTCO_wdt for the Intel TCO WatchDog that you probably have on your Intel-based server hardware. For our purposes here, the simplest and easiest to use kernel watchdog module is softdog, a software watchdog implemented at the kernel level. Softdog has the limitation that it doesn't help if the kernel itself hangs, which we don't really care about, but the advantage that it works everywhere (including in VMs) and seems to be quite reliable and predictable.

Some Linux distributions (such as Fedora) automatically load an appropriate kernel watchdog module depending on what hardware is available. Ubuntu 16.04 goes to the other extreme; it extensively blacklists all kernel watchdog modules, softdog included, so you can't even stick something in /etc/modules-load.d. To elide a long discussion, our solution to this was a new cslab-softdog.service systemd service that explicitly loaded the module using the following:

[Service]
Type=oneshot
RemainAfterExit=True
ExecStart=/sbin/modprobe softdog

With softdog loaded and ShutdownWatchdogSec configured, systemd appears to reliably reboot my test VMs and test hardware in situations where systemd-shutdown previously hung. It takes somewhat longer than my configured ShutdownWatchdogSec, apparently because softdog gives you an extra margin of time just in case, probably 60 seconds based on what modinfo says.

Sidebar: Limiting total shutdown time (perhaps)

As noted in comments on my first entry on our reboot problems, reboot.target and poweroff.target both normally have a JobTimeoutSec of 30 minutes. If my understanding of systemd is correct, setting any JobTimeoutSec here is supposed to force a reboot or poweroff if the first stage of shutdown takes that long (because rebooting is done by attempting to active reboot.target, which is a systemd 'job', which causes the job timeout to matter).

Although I haven't tested it yet, this suggests that combining a suitably short short JobTimeoutSec on reboot.target with ShutdownWatchdogSec would limit the total time your system will ever spend rebooting. Picking a good JobTimeoutSec value is not obvious; you want it long enough that daemons have time to shut down in an orderly way, but not so long that you go off to the machine room. 30 minutes is clearly too long for us, but 30 seconds would probably be too short for most servers.

SystemdShutdownWatchdog written at 02:28:17; Add Comment

2017-09-14

Sorting out systemd's system.conf, user.conf, and logind.conf

Here's a mistake that I've made more than once and that I'm going to try to get rid of by writing it down.

Systemd organizes running processes into a tree of, well, let's call them units for now (mechanically they're control groups), which partly manifests in the form of slice units. One of the big divisions in this hierarchy is between processes involved in services, which are put under system.slice, and user session processes, which are under user.slice. There are many situations where you would like to apply different settings to user processes than to system ones, partly because these processes are fundamentally different in several respects.

(For example, all services should normally have some way to explicitly stop them and this will normally do some sort of orderly shutdown of the service involved. User slices, sessions, and scopes have no such thing and thus no real concept of an 'orderly shutdown'; all you can do is hit them with various Unix signals until they go away. For user stuff, the orderly shutdown was generally supposed to happen when the user logged off.)

Systemd has two configuration files, system.conf and user.conf. One of the things system.conf can do is set global defaults for all units and all processes, both system processes (things under system.slice) and user processes (things under user.slice), for example DefaultTimeoutStopSec and DefaultCPUAccounting. As mentioned, there are plenty of times when you'd like to set or change these things only for user processes. You would think that systemd would provide a way to do this, and further if you're irritated with systemd and not paying close attention, you might think that user.conf can be used to set these things just for user processes. After all, surely systemd provides a way to do this obvious thing and 'user' is right there in the file's name. This is wrong.

What user.conf is for is covered in the manpage for both files; it sets these values for systemd user instances, which are per-user systemd instances that the user can control and do things with. Systemd user instances can be used for interesting things (see the Arch wiki on them), but I don't currently deal with any systems that use them actively so they're not on my mind much.

(Both Ubuntu 16.04 and Fedora 26 do start systemd user instances for people, but I don't think anyone on our systems uses them for anything; right now, they're just there.)

If systemd ever allows you to set things like DefaultCPUAccounting only for user processes, instead of globally, the place it might wind up is logind.conf, which configures systemd-logind, which is the systemd bit that actually sets up user slices, sessions, scopes, and so on (often in part through pam_systemd). This seems a logical location to me because systemd-logind is where user stuff is controlled in general and logind.conf already has the UserTasksMax setting. I don't know if anything like this is being contemplated by the systemd people, though, and there are alternate approaches such as allowing user-${UID}.slice slices to be templated (although in the current setup, this would require renaming them to have an @ in their name, eg user@${UID}.slice).

(I'm sure this seems like a silly mistake to make, and it certainly sounds like it when I've written it out like this. All I can say is that I've already made this mistake at least twice that I can remember; the most recent time made it into an irritated tweet that exhibited my misunderstanding.)

SystemdUserAndSystemConf written at 00:34:52; Add Comment

2017-09-07

Systemd, NFS mounts, and shutting down your system

After writing about our systemd reboot problem, I decided that I was irritated enough to spend part of today trying to dig into the situation (partly because of all of the comments and reactions to my entry, since they raised good questions and suggestions). I don't have any definite answers, partly because it's quite hard to see the state of the system when this is happening, but I do have some observations and notes (and some potentially useful contributions from other people on Twitter).

Here is what I've seen during shutdowns:

Under some circumstances, systemd will fail to unmount a NFS filesystem because processes are holding it busy but will go on to take down networking.

This is a dangerous situation to wind up in. With networking down, any attempt by a process to do anything on the filesystem will almost certainly give you an unrecoverably hung process; it's waiting for NFS IO to complete, but NFS IO requires a network to talk to the server, and the network isn't there and isn't coming back. Unfortunately it's disturbingly easy to wind up in this situation, thanks to our friend cron and other similar things.

Systemd appears to terminate processes in user slices reasonably early in the shutdown process, definitely before it starts trying to unmount NFS filesystems. However, as we've seen, not all 'user' processes are under user slices; some of them are hanging out in places like cron.service and apache2.service. Now, you might think that cron jobs and CGI processes and so on should be killed when systemd shuts down cron and Apache (which it normally does before unmounting NFS filesystems), but unfortunately both cron and Apache are set to KillMode=process, where systemd only terminates the main process when it stops the service. So all of your cron jobs, CGI processes, and so on linger on until systemd gets around to killing them much later (I believe as part of running systemd-shutdown but I'm not sure).

(You can have this issue with any systemd service or daemon that starts multiple processes but uses KillMode=process. I believe that all System V init scripts handled through systemd's backwards compatibility implicitly run in this mode; certainly Ubuntu 16.04's /etc/init.d/apache2 does.)

As it happens, our user-managed web servers mostly get run from cron with @reboot entries (as the simple way to start on system boot). I suspect that it's not a coincidence that our web server almost always experiences a hang during reboots. We have another server that often experiences reboot hangs, and there people use at and atd.

(The mere presence of lingering processes doesn't doom you, because they might not try to do any (NFS) IO when systemd sends them a SIGTERM. However, any number of things may react to SIGTERM by trying to do cleanups, for example by writing out a database or a log record, and if they are running from a NFS filesystem that is now cut off from the network, this is a problem.)

All of this description sounds very neat and pat, but it's clearly not the full story because I can't consistently reproduce a shutdown hang although I can consistently create cut-off NFS mounts with not yet killed processes that are holding them busy (although I've got some more ideas to try). This gets me around to the things that don't work and one thing that might.

In comments, Alan noted that the stock systemd poweroff.target and reboot.target both have 30 minute timeouts, after which they force a poweroff or a reboot. Unfortunately these timeouts don't seem to be triggering in my tests, for whatever reason; I left a hung VM sitting there for well over half an hour at one point with its reboot.target timeout clearly not triggering.

On Twitter, Mike Kazentsev mentioned that system.conf has a ShutdownWatchdogSec option to use a hardware watchdog to force a reboot if the system becomes sufficiently unresponsive. Unfortunately this watchdog doesn't limit the total reboot time, because systemd-shutdown pings it every time it loops around trying to unmount filesystems and turn off swap space and so on. As long as systemd-shutdown thinks it's making some progress, the watchdog won't fire. Setting the watchdog low will protect you against systemd-shutdown hanging, though, and that may be worthwhile.

(Also, as I found out last year and then forgot until I painfully rediscovered it today, you can't easily reduce the timeout on user slices so that lingering processes in user slices are terminated faster on shutdown. This means that on many machines, you can be stuck with a more than 90 second shutdown time in general.)

Sidebar: The obvious brute force solution for us

As far as we know, our problems come from processes run by actual real people, not from system daemons that are lingering around. These users exist in a defined UID range, so it wouldn't be particularly difficult to write a program that scanned /proc for not-yet-killed user processes and tried to terminate them all. We could try to be creative about the ordering of this program during shutdown (so it ran after systemd had already shut down as many user scopes and slices as possible), or just run it based on convenient dependencies and accept that it would kill processes that systemd would clean up on its own.

SystemdNFSMountShutdown written at 01:28:14; Add Comment

2017-09-06

Systemd on Ubuntu 16.04 can't (or won't) reliably reboot your server

We just went through a periodic exercise of rebooting all of our Ubuntu servers in order to get up to date on kernels and so on. By now almost all of our servers are running Ubuntu 16.04, which means that they're using systemd. Unfortunately this gives us a real problem, because on Ubuntu 16.04, systemd won't reliably reboot your system. On some servers, usually the busiest and most important ones, the system will just stop during the shutdown process and sit there. And sit there. And sit there. Perhaps it would eventually recover after tens of minutes, but as mentioned these are generally our busiest and most important servers, so we're not exactly going to let them sit there to find out what happens eventually.

(There also probably isn't much point to finding out. It's unlikely that there's some miracle cure we can do ourselves, and making a bug report to Ubuntu is almost completely pointless since Ubuntu only fixes security issues and things that are actively on fire. My previous experience wasn't productive and produced no solutions from anyone.)

This goes well beyond my previous systemd reboot irritation. Reliably rebooting servers despite what users are doing to them is a fairly foundational thing, yet Ubuntu's systemd not only can't get this right but doesn't even tell us what's wrong (in the sense of 'what is keeping me from rebooting'). The net effect is to turn rebooting many of our servers into a minefield. Not only may a reboot require in-person intervention in our machine room, but that we can't count on a reboot just working means that we actively have to pay attention to the state of every machine when we reboot them; we can't just assume that machines will come back up on their own unless something is fairly wrong. The whole experience angers me every time I have to go through it.

By now we've enabled persistent systemd journals on most everything in the hopes of capturing useful information so we can perhaps guess why this is happening. Unfortunately so far we've gotten nothing useful; systemd has yet to log or display on the screen, say, 'still waiting N seconds for job X'. I'm not even convinced that the systemd journal has captured all of the log messages that it should from an unsuccessful shutdown, as what 'journalctl -b-1' shows is much less than I'd expect and just stops abruptly.

(Without an idea of how and why systemd is screwing up, I'm reluctant to change DefaultTimeoutStopSec from its Ubuntu default, as I once discussed here, or make other changes like forcing all user cron jobs to run under user slices.)

(This Ubuntu bug matches one set of symptoms we see, but not all of them. Note that our problem is definitely not the Linux kernel having problems rebooting the hardware; the same Dell servers were previously running Ubuntu 14.04 and rebooting fine, and Magic SysRQ will force reboots without problems. There's also this Ubuntu bug and this report of problems with shutting down when you have NFS mounts, which certainly could be part of our problems.)

SystemdUbuntuRebootFailure written at 02:27:09; Add Comment

(Previous 10 or go back to September 2017 at 2017/09/03)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.