Wandering Thoughts archives

2017-09-29

Shell builtin versions of standard commands have drawbacks

I'll start with a specific illustration of the general problem:

bash# kill -SIGRTMIN+22 1
bash: kill: SIGRTMIN+22: invalid signal specification
bash# /bin/kill -SIGRTMIN+22 1
bash#

The first thing is that yes, this is Linux being a bit unusual. Linux has significantly extended the usual range of Unix signal numbers to include POSIX.1-2001 realtime signals, and then can vary what SIGRTMIN is depending on how a system is set up. Once Linux had these extra signals (and defined in the way they are), people sensibly added support for them to versions of kill. All of this is perfectly in accord with the broad Unix philosophy; of course if you add a new facility to the system you want to expose it to shell scripts when that's possible.

Then along came Bash. Bash is cross-Unix, and it has a builtin kill command, and for whatever reason the Bash people didn't modify Bash so that on Linux it would support the SIGRTMIN+<n> syntax (some possible reasons for that are contained in this sentence). The results of that are a divergence between the behavior of Bash's kill builtin and the real kill program that have become increasingly relevant now that programs like systemd are taking advantage of the extra signals to allow you to control more of their operations by sending them more signals.

Of course, this is a generic problem with shell builtins that shadow real programs in any (and all) shells; it's not particularly specific to Bash (zsh also has this issue on Linux, for example). There are advantages to having builtins, including builtins of things like kill, but there are also drawbacks. How best to fix or work around them isn't clear.

(kill is often a builtin in shells with job control, Bash included, so that you can do 'kill %<n>' and the like. Things like test are often made builtins for shell script speed, although Unixes can take that too far.)

PS: certainly one answer is 'have Bash implement the union of all special kill, test, and so on features from all Unixes it runs on', but I'm not sure that's going to work in practice. And Bash is just one of several popular shells, all of whom would need to keep up with things (or at least people probably want them to do so).

unix/BashKillBuiltinDrawback written at 21:40:28; Add Comment

More on systemd on Ubuntu 16.04 failing to reliably reboot some of our servers

I wrote about how Ubuntu 16.04 can't reliably reboot some of our servers, then discovered that systemd can shut down the network with NFS mounts still present and speculated this was (and is) one of our problems. I've now been able to reliably produce such a reboot failure on a test VM and narrow down the specific component involved.

Systemd shuts down your system in two stages; the main stage that stops systemd units, and the final stage, done with systemd-shutdown, which kills the remaining processes, fiddles around with the remaining mounts, and theoretically eventually reboots the system. In the Ubuntu 16.04 version of systemd-shutdown, part of what it tries to do with NFS filesystems is to remount them read-only, and for us this sometimes hangs. With suitable logging enabled in systemd so that systemd-shutdown is run with it, we see:

Sending SIGTERM to remaining processes...
Sending SIGKILL to remaining processes...
Sending SIGKILL to PID <nnn> (<command>)
Unmounting file systems.
Remounting '/var/mail' read-only with options '<many of them>'.

At this point things hang, although if you have it set up a shutdown watchdog will force a reboot and recover the system. Based on comments on my second entry, systemd-shutdown doing this is (now) seen as a problem and it's been changed in the upstream version of systemd, although only very recently (eg this commit only landed at the end of August).

Unfortunately this doesn't seem to be the sole cause of our shutdown hangs. We appear to have had at least one reboot hang while systemd attempts to swapoff the server's swap space, before it enters late-stage reboot. This particular server has a lot of inactive user processes because it hosts our user-managed web servers, and (at the time) they weren't being killed early in system shutdown, so turning off swap space presumably had to page a lot of things back into RAM. This may not have actually hung as such, but if so it was sufficiently slow as to be unacceptable and we force-rebooted the server in question after a minute or two.

We're currently using multiple ways to hopefully reduce the chances of hangs at reboot times. We've put all user cron jobs into systemd user slices so that systemd will kill them early, although this doesn't always work and we may need some clever way of dealing with the remaining processes. We've enabled a shutdown watchdog timer with a relatively short timeout, although this only helps if the system makes it to the second stage when it runs systemd-shutdown; a 'hang' before then in swapoff won't be interrupted.

In the future we may enable a relatively short JobTimeoutSec on reboot.target, in the hopes that this does some good. I've considered changing Ubuntu's cron.service to KillMode=control-group and then holding the package to prevent surprise carnage during package upgrades, but this seems to be a little bit too much hassle and danger for an infrequent thing that is generally merely irritating.

As a practical matter, this entry is probably the end of the saga. This is not a particularly important thing for us and I've already discovered that there are no simple, straightforward, bug-free fixes (and as usual the odds are basically zero that Ubuntu will fix bugs here). If we're lucky, Ubuntu 18.04 will include a version of systemd with the systemd-shutdown NFS mount fixes in it and perhaps pam_systemd will be more reliable for @reboot cron jobs. If we're not lucky, well, we'll keep having to trek down to the machine room when we reboot servers. Fortunately it's not something we do very often.

linux/SystemdUbuntuRebootFailureII written at 00:35:45; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.