A surprise to remember about starting modern machines
In the very old days you connected to Unix systems through serial
terminals, which only had their getty processes started once
init
had finished processing /etc/rc
.
In the old days you connected to Unix systems through rlogin(d),
which was started through inetd
, which was still started pretty
much at the end of system startup.
These days you connect to Unix systems through sshd
, which is often
started relatively early in the system boot sequence. This means that
you can easily wind up logging into a machine that hasn't finished
booting, and conversely that just because you can ssh into a machine
doesn't mean that it's finished booting.
This mistake was at the root of my debugging adventure today. We're
switching to a new system of managing NFS mounts on our Ubuntu machines,
and I was seeing a mysterious problem where the test machine would boot
up with its NFS mounts partially or almost completely missing. Due to
local needs we start sshd
before doing our NFS mounts, which we have
a lot of, so what was really going on was that I was logging in to the
machine while it was grinding through the NFS mounts. Once I realized
what was actually going on it was a definite forehead-slapping moment
(although a reassuring one, apart from the wasted time, since nothing
was actually wrong).
You can get into really weird states because of this. In the past I've
managed to have init.d
scripts hang trying to start something; if
they run after sshd
starts you could still log in to the system, poke
around, and have everything look pretty normal (depending on what was
left in the boot sequence). Except that things like reboot
wouldn't
do anything, because as far as init
is concerned it was only part way
through transitioning into a runlevel and it wasn't about to let you
change to another one just yet. The whole experience can make you think
that the machine is badly broken, because reboot
doesn't complain and
a machine that doesn't reboot on command is usually in serious trouble
(often with things like kernel panics, unkillable stuck processes, and
so on).
(I think what tipped me off back then was the same thing as this time around; I got a process tree dump and saw the startup script still running.)
|
|