2016-04-09
Why your Ubuntu server stalls a while on boot if networking has problems
Yesterday I wrote on how to shoot yourself in the foot by making
a mistake in /etc/network/interfaces
.
I kept digging into this today, and so now I can tell you why this
happens and what you can do about it. The simple answer is that it
comes from /etc/init/failsafe.conf
.
What failsafe.conf
is trying to do is kind of hard to explain
without a background in Upstart (Ubuntu's 'traditional' init system).
A real System V init system is always in a 'runlevel', and this
drives what it does (eg it determines which /etc/rcN.d
directory
to process). Upstart sort of half abandons runlevels; they are not
built into Upstart itself and some /etc/init
jobs don't use them,
but there's a standard Upstart event to set the runlevel and
many /etc/init
jobs are started and stopped based on this runlevel
event.
Let's simplify that: Upstart's runlevel stuff is a way of avoiding
specifying real dependencies for /etc/init
jobs and handling them
for /etc/rcN.d
scripts. Instead jobs can just say 'start on
runlevel [2345]
' and get started once the system has finished its
basic boot processing, whatever that is and whatever it takes.
Since the Upstart runlevel is not built in, something must generate
an appropriate 'runlevel N' event during boot at an appropriate
time. That thing is /etc/init/rc-sysinit.conf
, which in turn
must be careful to run only at some appropriate point in Upstart's
boot process, once this basic boot processing is done. When is basic
boot processing done? Well, the rc-sysinit.conf
answer is 'when
filesystems are there and static networking is up', by in Upstart
terms means when the filesystem(7)
and static-network-up
upstart events
are emitted by something.
So what happens if networking doesn't come fully up, for instance
if your /etc/network/interfaces
has a mistake in it? If Upstart
left things as they were, your system would just hang in early boot;
rc-sysinit.conf
would be left waiting for an Upstart event that
would never happen. This is what failsafe.conf
is there for. It
waits a while for networking to come up, and if that doesn't happen
it emits a special Upstart event that tells rc-sysinit.conf
to
go on anyways.
In the abstract this is a sensible idea. In the concrete, failsafe.conf
has a number of problems:
- the timeout is hardcoded, which means that it's guaranteed to
be too long for some people and probably not long enough for
others.
- it doesn't produce any useful messages when it has to delay,
and if you're not using Plymouth
it's totally silent. Servers typically don't run Plymouth.
- Upstart as a whole has a very inflexible view of what 'static
networking is up' means. It apparently requires that every 'auto'
interface listed in
/etc/network/interfaces
both exist and have link signal (have a cable plugged in and be connected to something); see eg this bug and this bug. You don't get to say 'proceed even without link signal' or 'this interface is optional' or the like.
For Ubuntu versions that use Upstart, you can fix this by changing
/etc/init/failsafe.conf
to shorten the timeouts and print out
actual messages (anything you output with eg echo
will wind up
on the console). We're in the process of doing this locally; I
opted to print out a rather verbose message for my usual reasons.
Of course, all of this is going to be inapplicable in the upcoming
Ubuntu 16.04, since Ubuntu switched from Upstart to systemd as of
15.04 (cf).
However Ubuntu has put something similar to failsafe.conf
into their systemd setup and thus I expect that we'll wind up making
similar modifications to it in some way.
(A true native systemd setup has a completely different and generally more granular way of handling failures to bring up networking, but I don't expect Ubuntu to make that big of a change any time soon.)