A surprise to remember about starting modern machines

March 28, 2007

In the very old days you connected to Unix systems through serial terminals, which only had their getty processes started once init had finished processing /etc/rc.

In the old days you connected to Unix systems through rlogin(d), which was started through inetd, which was still started pretty much at the end of system startup.

These days you connect to Unix systems through sshd, which is often started relatively early in the system boot sequence. This means that you can easily wind up logging into a machine that hasn't finished booting, and conversely that just because you can ssh into a machine doesn't mean that it's finished booting.

This mistake was at the root of my debugging adventure today. We're switching to a new system of managing NFS mounts on our Ubuntu machines, and I was seeing a mysterious problem where the test machine would boot up with its NFS mounts partially or almost completely missing. Due to local needs we start sshd before doing our NFS mounts, which we have a lot of, so what was really going on was that I was logging in to the machine while it was grinding through the NFS mounts. Once I realized what was actually going on it was a definite forehead-slapping moment (although a reassuring one, apart from the wasted time, since nothing was actually wrong).

You can get into really weird states because of this. In the past I've managed to have init.d scripts hang trying to start something; if they run after sshd starts you could still log in to the system, poke around, and have everything look pretty normal (depending on what was left in the boot sequence). Except that things like reboot wouldn't do anything, because as far as init is concerned it was only part way through transitioning into a runlevel and it wasn't about to let you change to another one just yet. The whole experience can make you think that the machine is badly broken, because reboot doesn't complain and a machine that doesn't reboot on command is usually in serious trouble (often with things like kernel panics, unkillable stuck processes, and so on).

(I think what tipped me off back then was the same thing as this time around; I got a process tree dump and saw the startup script still running.)

Written on 28 March 2007.
« Dual identity routing with Linux's policy based routing
Usability issues with blog URLs »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Mar 28 23:49:14 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.