Why your physical servers running Ubuntu 22.04 LTS can boot very slowly

April 26, 2022

If you install Ubuntu 22.04's server edition onto a server that has one or more network ports that you aren't using, it's quite likely that you'll get to see an unexpected two minute pause during system boot. In some configurations this is a total stall, with neither local nor remote logins possible. This behavior didn't happen in 20.04, although some of the underlying issues were there, and unfortunately it's rather hard to automatically work around.

The direct source of the stall is our old friend systemd-networkd-wait-online, which in 22.04 waits 120 seconds (two minutes) until all of your network links are "configured". More specifically, it waits until all links that systemd-networkd knows about are configured. Unfortunately, interfaces that are listed as having DHCP enabled on them only satisfy s-n-w-o if the system actually gets a DHCP address from the network, which is where the rest of the problem starts coming in.

(In 20.04, I don't believe this happened if you had some stray interfaces still set to DHCP. You could pick up these interfaces relatively easily.)

The Ubuntu 22.04 server installer, subiquity, automatically performs DHCP on all interfaces on your server it finds. Regardless of whether or not it gets any DHCP answers, or even if the interface is disconnected, it carries over this 'try DHCP' state to the installed system unless you manually change it, interface by interface. In theory subiquity will let you turn this attempted DHCP off. In practice, this doesn't work in 22.04 (although it did in 20.04). With every interface set to do DHCP in the installed system, any unused and disconnected interfaces will cause the systemd-networkd-wait-online two minute timeout, as it waits for DHCP answers on them all.

This is a significant issue for people with physical servers because it's fairly routine for physical servers to have extra interfaces. Modern Dell 1U servers come with at least two, for example, and most of our servers are only using one. Do you have a server with 1G onboard but you need 10G so you put in an add-on card? Now you have two unused 1G ports that are open to this issue.

(Of course in theory you can avoid this issue by carefully going through all unused interfaces on every server install and doing the several steps to explicitly disable them. Since this requires fallible humans to not ever fail, you can guess what I think of it in practice.)

The somewhat obvious apparent workaround is to run a sed over your system's /etc/netplan/00-installer-config.yaml to turn 'dhcp4: true' into 'dhcp4: false'. Unfortunately this does not actually work. At boot time, any interface mentioned in your netplan configuration will become an interface known to systemd-networkd, and then systemd-networkd-wait-online will wind up waiting for it, even if there is no way it can get a configuration because it's not doing DHCP and has no IP address set.

Instead, you must either delete all inactive interfaces from your netplan configuration or, equivalently, write a completely new version of your netplan configuration that only mentions the active interfaces. Since as far as I know there are no command line tools to manipulate netplan files to delete interfaces and so on, the second approach may be easier to automate in a script. Remember that you're going to have to embed this script into the install image and arrange to run it at install time, unless you enjoy waiting two extra minutes for the system to boot the first time.

This issue is probably much less acute for virtual servers, because my impression is that virtual servers are usually only configured with the network interfaces that they're actually going to use. Physical servers are not so convenient.

(Even if the network interfaces can be disabled in the BIOS, that requires a trip through the BIOS. And makes life harder on people who are reusing the physical hardware later.)

As far as I can tell from a number of attempts, there is also no way to fix this by modifying systemd-networkd-wait-online command line parameters. If I so much as touch these, things seem to explode, generally with s-n-w-o finishing much too fast, before the network is actually configured. Sometimes fiddling seems to trigger mysterious failures and timeouts starting other programs. Unfortunately s-n-w-o has no verbosity or debugging options; it's a silent black box, with no way of extracting what it's decided to look at, what it thinks the state is at various points, and so on.

(This elaborates on some tweets of mine.)

PS: Even in a 22.04 install without this issue, it can take over ten seconds for systemd-networkd-wait-online to decide that the network is actually online, for a configuration with a single, statically configured (virtual) network. I really don't know what it's doing there.

Written on 26 April 2022.
« Sort of making snapshots of UEFI libvirt-based virtual machines
The root cause of my xdg-desktop-portal problems on a Fedora machine »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Apr 26 22:59:50 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.