2016-03-05
What happens when a modern Linux system boots without /bin/sh
It turns out that a whole lot of things explode when your system boots up with /bin/sh not working for some mysterious reason.
Here the mysterious reason was that there was an unresolved dynamic
library symbol, so any attempt to run /bin/sh
or /bin/bash
died
with an error message from the ELF interpreter.
The big surprise for me was just how far my systemd-based Fedora
23 machine managed to get despite this handicap. I certainly saw a
cascade of unit failures in the startup messages so I knew that
something bad had happened, but the first inkling I had of just how
bad it was came when I tried to log in as root
on the (text)
console and the system just dumped me back at the login:
prompt.
Most of the system services had managed to start because their
systemd .service
files did not need /bin/sh
to run; only a few
things (some of them surprising) had failed, although a dependency
chain for one of them wound up blocking the local resolving DNS
server from starting.
The unpleasant surprise was how much depends on /bin/sh
and
/bin/bash
working. I was able to log in as myself because I use
a different shell, but
obviously root
was inaccessible, my own environment relies on a
certain amount of shell scripts to be really functional, and a
surprising number of standard binaries are shell scripts these days
(/usr/bin/fgrep
, for example). In the end I got somewhat lucky
in that my regular account had sudo
access and sudo
can be used
to run things directly, without needing /bin/sh
or root's shell
to be functioning.
(I mostly wound up using this to run less
to read logs and
eventually reboot
. If I'd been thinking more carefully, I could
have used sudo
to run an alternate shell as root, which would
have been almost as good as being able to log in directly.)
Another pretty useful thing here is how systemd captured a great deal of the error output from startup services and recorded it in the systemd journal. This gave me the exact error messages, for example, which is at least reassuring to have even if I don't understand what went wrong.
What I don't have here is an exciting story of how I revived a
system despite its /bin/sh
being broken. In the end the problem
went away after I rebooted and then power cycled my workstation.
Based on the symptoms I suspect that a page in RAM got scrambled
somehow (which honestly is a bit unnerving).
As a side note, the most surprising thing that failed to start was
udev trying to run the install command for the sound card drivers
(specifically snd_pcm
). I suspect that this is used to restore
the sound volume settings to whatever they were the last time the
system was shut down, but I don't know for sure because things
didn't report the exact command being executed or whatever.
(My system has a 90-alsa-restore.rules udev rules file that tries
to run alsactl
. It's not clear to me if udev executes RUN+=
commands via system()
, which would have hit the issue, or in
some more direct way. Maybe it depends on whether the RUN command
seems to have anything that needs interpretation by the shell. I'm
pretty certain that at least some udev RUN
actions succeeded.)
Sidebar: What exactly was wrong
This was on my Fedora 23 office machine, where /bin/sh
is bash, and
bash was failing to start with a message to the effect of:
symbol lookup error: /bin/bash: undefined symbol: rl_unix_line_disc<binary garbage>
Bash does not mention a symbol with that exact name, but it does
want to resolve and use rl_unix_line_discard
. Interestingly,
this is an internal symbol (it's both used and defined in bash);
despite this, looking it up goes via the full dynamic linker symbol
resolution process (as determined with the help of LD_DEBUG
).
My guess is that the end of the symbol name was overwritten in RAM
with some garbage and that this probably happened in the Linux
kernel page cache (since it kept reappearing with the same message,
it can't have been in a per-process page).
Assuming I'm reading things correctly, the bytes of garbage are (in hex):
ae 37 d8 5f bf 6b d1 45 3a c0 d9 93 1b 44 12 2d 68 74
(less
displays this as '<AE>7<D8>_<BF>k<D1>E:<C0>ٓ^R-ht
', which
doesn't fully capture it. I had to run a snippet of journalctl's
raw output through 'od -t c -t x1' to get the exact hex.)