A surprise to remember about starting modern machines

March 28, 2007

In the very old days you connected to Unix systems through serial terminals, which only had their getty processes started once init had finished processing /etc/rc.

In the old days you connected to Unix systems through rlogin(d), which was started through inetd, which was still started pretty much at the end of system startup.

These days you connect to Unix systems through sshd, which is often started relatively early in the system boot sequence. This means that you can easily wind up logging into a machine that hasn't finished booting, and conversely that just because you can ssh into a machine doesn't mean that it's finished booting.

This mistake was at the root of my debugging adventure today. We're switching to a new system of managing NFS mounts on our Ubuntu machines, and I was seeing a mysterious problem where the test machine would boot up with its NFS mounts partially or almost completely missing. Due to local needs we start sshd before doing our NFS mounts, which we have a lot of, so what was really going on was that I was logging in to the machine while it was grinding through the NFS mounts. Once I realized what was actually going on it was a definite forehead-slapping moment (although a reassuring one, apart from the wasted time, since nothing was actually wrong).

You can get into really weird states because of this. In the past I've managed to have init.d scripts hang trying to start something; if they run after sshd starts you could still log in to the system, poke around, and have everything look pretty normal (depending on what was left in the boot sequence). Except that things like reboot wouldn't do anything, because as far as init is concerned it was only part way through transitioning into a runlevel and it wasn't about to let you change to another one just yet. The whole experience can make you think that the machine is badly broken, because reboot doesn't complain and a machine that doesn't reboot on command is usually in serious trouble (often with things like kernel panics, unkillable stuck processes, and so on).

(I think what tipped me off back then was the same thing as this time around; I got a process tree dump and saw the startup script still running.)


Comments on this page:

From 199.172.169.7 at 2007-04-10 12:58:32:

This is very easy to fix - at the start of the init process echo "System currently booting please wait" or similar to /etc/nologin. Then as the last init script rm the file. Root can still login, so if things go wrong you can fix at the (serial) console.

By cks at 2007-04-10 13:20:43:

Part of our problem is that we can't actually block ssh logins during boot, because we use an ssh login as part of our NFS mount authentication scheme. (This causes all sorts of fun on Ubuntu machines, which normally want to do NFS mounts the moment the network comes up.)

If we have to do anything, I suspect we'll wind up using sshd's banner option to print a warning message for connections made during boot. Right now we're hoping that users never wind up noticing it so we don't have to do anything about it.

Written on 28 March 2007.
« Dual identity routing with Linux's policy based routing
Usability issues with blog URLs »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Mar 28 23:49:14 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.