The problem of machine startup order dependencies

March 14, 2007

One of the tricky bits of organizing a sufficiently large group of machines is avoiding circular dependencies in the machine startup order, so that you can actually bring your systems up after things like a complete machine room power outage.

(In our case it was planned; the electricians wanted the master breakers off before they played around in our breaker panel to give us more usable circuits.)

Startup order dependencies come in a variety of flavours. The simple one is a startup script that depends on another machine being up, for example trying to NFS mount filesystems; more advanced, more dangerous, and fortunately much rarer is the sort where a machine will start but malfunction (for example, bounce all email) unless another machine is already up. Things like NFS mounts are easy to see, but sometimes the dependency is more indirect and much less obvious.

Part of the problem is that it's easy for this sort of dependency to creep in unnoticed. Not only is a complete ground-up restart of all of your machines hopefully a rare event but testing for this sort of thing is difficult to do, especially for machines in the middle of the startup order (where they depend on some other machines but not everything).

(You can always do a testing ground-up restart of everything, but this is sufficiently disruptive that you're probably not going get to do it very often.)

The interesting case that we found recently was machines that try to set their time on startup with ntpdate, especially our console server (which is the first machine we start). In the early boot order, none of our time server machines are alive to respond to ntpdate; fortunately it has a timeout. But up until that point I hadn't thought of NTP as a vital core service.

(For bonus fun, what actually timed out on the console server was ntpdate's DNS lookups, because all of the time servers to synchronize with had been specified as hostnames instead of IP addresses. Since the machine had three time servers and two DNS servers listed in /etc/resolv.conf, this actually took significantly longer than ntpdate's actual query timeout.)

Written on 14 March 2007.
« Machine room archaeology
An annoying limitation of Linux IPSec »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Mar 14 23:26:05 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.