The problem of machine startup order dependencies
One of the tricky bits of organizing a sufficiently large group of machines is avoiding circular dependencies in the machine startup order, so that you can actually bring your systems up after things like a complete machine room power outage.
(In our case it was planned; the electricians wanted the master breakers off before they played around in our breaker panel to give us more usable circuits.)
Startup order dependencies come in a variety of flavours. The simple one is a startup script that depends on another machine being up, for example trying to NFS mount filesystems; more advanced, more dangerous, and fortunately much rarer is the sort where a machine will start but malfunction (for example, bounce all email) unless another machine is already up. Things like NFS mounts are easy to see, but sometimes the dependency is more indirect and much less obvious.
Part of the problem is that it's easy for this sort of dependency to creep in unnoticed. Not only is a complete ground-up restart of all of your machines hopefully a rare event but testing for this sort of thing is difficult to do, especially for machines in the middle of the startup order (where they depend on some other machines but not everything).
(You can always do a testing ground-up restart of everything, but this is sufficiently disruptive that you're probably not going get to do it very often.)
The interesting case that we found recently was machines that try to
set their time on startup with
ntpdate, especially our console server
(which is the first machine we start). In the early boot order, none of
our time server machines are alive to respond to
it has a timeout. But up until that point I hadn't thought of NTP as a
vital core service.
(For bonus fun, what actually timed out on the console server was
ntpdate's DNS lookups, because all of the time servers to synchronize
with had been specified as hostnames instead of IP addresses. Since
the machine had three time servers and two DNS servers listed in
/etc/resolv.conf, this actually took significantly longer than
ntpdate's actual query timeout.)