What happens when a modern Linux system boots without /bin/sh

March 5, 2016

So this happened to me:

It turns out that a whole lot of things explode when your system boots up with /bin/sh not working for some mysterious reason.

Here the mysterious reason was that there was an unresolved dynamic library symbol, so any attempt to run /bin/sh or /bin/bash died with an error message from the ELF interpreter.

The big surprise for me was just how far my systemd-based Fedora 23 machine managed to get despite this handicap. I certainly saw a cascade of unit failures in the startup messages so I knew that something bad had happened, but the first inkling I had of just how bad it was came when I tried to log in as root on the (text) console and the system just dumped me back at the login: prompt. Most of the system services had managed to start because their systemd .service files did not need /bin/sh to run; only a few things (some of them surprising) had failed, although a dependency chain for one of them wound up blocking the local resolving DNS server from starting.

The unpleasant surprise was how much depends on /bin/sh and /bin/bash working. I was able to log in as myself because I use a different shell, but obviously root was inaccessible, my own environment relies on a certain amount of shell scripts to be really functional, and a surprising number of standard binaries are shell scripts these days (/usr/bin/fgrep, for example). In the end I got somewhat lucky in that my regular account had sudo access and sudo can be used to run things directly, without needing /bin/sh or root's shell to be functioning.

(I mostly wound up using this to run less to read logs and eventually reboot. If I'd been thinking more carefully, I could have used sudo to run an alternate shell as root, which would have been almost as good as being able to log in directly.)

Another pretty useful thing here is how systemd captured a great deal of the error output from startup services and recorded it in the systemd journal. This gave me the exact error messages, for example, which is at least reassuring to have even if I don't understand what went wrong.

What I don't have here is an exciting story of how I revived a system despite its /bin/sh being broken. In the end the problem went away after I rebooted and then power cycled my workstation. Based on the symptoms I suspect that a page in RAM got scrambled somehow (which honestly is a bit unnerving).

As a side note, the most surprising thing that failed to start was udev trying to run the install command for the sound card drivers (specifically snd_pcm). I suspect that this is used to restore the sound volume settings to whatever they were the last time the system was shut down, but I don't know for sure because things didn't report the exact command being executed or whatever.

(My system has a 90-alsa-restore.rules udev rules file that tries to run alsactl. It's not clear to me if udev executes RUN+= commands via system(), which would have hit the issue, or in some more direct way. Maybe it depends on whether the RUN command seems to have anything that needs interpretation by the shell. I'm pretty certain that at least some udev RUN actions succeeded.)

Sidebar: What exactly was wrong

This was on my Fedora 23 office machine, where /bin/sh is bash, and bash was failing to start with a message to the effect of:

symbol lookup error: /bin/bash: undefined symbol: rl_unix_line_disc<binary garbage>

Bash does not mention a symbol with that exact name, but it does want to resolve and use rl_unix_line_discard. Interestingly, this is an internal symbol (it's both used and defined in bash); despite this, looking it up goes via the full dynamic linker symbol resolution process (as determined with the help of LD_DEBUG). My guess is that the end of the symbol name was overwritten in RAM with some garbage and that this probably happened in the Linux kernel page cache (since it kept reappearing with the same message, it can't have been in a per-process page).

Assuming I'm reading things correctly, the bytes of garbage are (in hex):

ae 37 d8 5f bf 6b d1 45 3a c0 d9 93 1b 44 12 2d 68 74

(less displays this as '<AE>7<D8>_<BF>k<D1>E:<C0>ٓ^R-ht', which doesn't fully capture it. I had to run a snippet of journalctl's raw output through 'od -t c -t x1' to get the exact hex.)

Written on 05 March 2016.
« Some notes on supporting readline (tab) completion in your Python program
Firefox addons seem unfortunately prone to memory leaks »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Mar 5 01:07:03 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.