2019-08-16
A gotcha with Fedora 30's switch of Grub to BootLoaderSpec based configuration
I upgraded my office workstation from Fedora
29 to Fedora 30 yesterday. In the past, such upgrades been problem
free, but this time around things went fairly badly, with the
first and largest problem being that after the upgrade, booting any
kernel gave me a brief burst of kernel messages, then a blank screen
and after a few minutes a return to the BIOS and Grub main menu.
To get my desktop to boot at all, I had to add 'nomodeset
' to the
kernel command line; among other consequences, this made my desktop
a single display machine instead of a dual display one.
(It was remarkably disorienting to have my screen mirrored across both displays. I kept trying to change to the 'other' display and having things not work.)
The short version of the root cause is that my grub.cfg
was
rebuilt using outdated kernel command line arguments that came from
/etc/default/grub
, instead of the current command line arguments
that had previously been used in my original grub.cfg
. Because
of how the Fedora 30 grub.cfg
is implemented, these wrong command
line arguments were then remarkably sticky and it wasn't clear how
to change them.
In Fedora 29 and earlier, your grub.cfg
is probably being maintained
through grubby
, Fedora's program for this. When grubby
adds a
menu entry for a new kernel, it more or less copies the kernel
command line arguments from your current one. While there is a
GRUB_CMDLINE_LINUX
setting in /etc/default/grub, its contents
are ignored until and unless you rebuild your grub.cfg
from
scratch, and there's nothing that tries to update it from what your
current kernels in your current grub.cfg
are actually using. This
means that your /etc/default/grub version can wind up being very
different from what you're currently using and actually need to
make your kernels work.
One of the things that usually happens by default when you upgrade
to Fedora 30 is that Fedora switches how grub.cfg
is created and
updated from the old way of doing it itself via grubby
to using
a Boot Loader Specification (BLS) based scheme; you
can read about this switch in the Fedora wiki.
This switch regenerates your grub.cfg
using a shell script called
(in Fedora) grub2-switch-to-blscfg
, and this shell script of
course uses /etc/default/grub's GRUB_CMDLINE_LINUX
as the source
of the kernel arguments.
(This is controlled by whether GRUB_ENABLE_BLSCFG
is set to
true
or false
in your /etc/default/grub. If it's not set at
all, grub2-switch-to-blscfg adds a 'GRUB_ENABLE_BLSCFG=true
'
setting to /etc/default/grub for you, and of course goes on to
regenerate your grub.cfg
. grub2-switch-to-blscfg itself is run
from the Fedora 30 grub2-tools
RPM posttrans scriptlet if
GRUB_ENABLE_BLSCFG
is not already set to something in your
/etc/default/grub
.)
A regenerated grub.cfg
has a default_kernelopts
setting, and
that looks like it should be what you want to change. However, it
is not. The real kernel command line for normal BLS entries is
actually in the Grub2 $kernelopts
environment variable, which is
loaded from the grubenv
file, normally /boot/grub2/grubenv
(which may be a symlink to /boot/efi/EFI/fedora/grubenv
, even if
you're not actually using EFI boot). The
best way to change this is to use 'grub2-editenv - list
' and
'grub2-editenv - set kernelopts="..."
'. I assume that
default_kernelopts
is magically used by the blscfg
Grub2
module if $kernelopts
is unset, and possibly gets written back
to grubenv
by Grub2 in that case.
(You can check that your kernels are using $kernelopts
by inspecting
an entry in /boot/loader/entries
and seeing that it has 'options
$kernelopts
' instead of anything else. You can manually change
that for a specific entry if you want to.)
This is going to make it more interesting (by which I mean annoying)
if and when I need to change my standard kernel options. I think
I'm going to have to change all of /etc/default/grub, the kernelopts
in grubenv, and the default_kernelopts
in grub.cfg
, just to
be sure. If I was happy with the auto-generated grub.cfg
, I could
just change /etc/default/grub and force a regeneration, but I'm not
and I have not yet worked out how to make its handling of the video
modes and the menus agree with what I want (which is a basic text
experience).
(While I was initially tempted to leave my system as a non-BLS
system, I changed my mind because of long term issues. Fedora will
probably drop support for grubby
based setups sooner or later,
so I might as well get on the BLS train now.)
To give credit where it's due, one (lucky) reason that I was able to eventually work out all of this is that I'd already heard about problems with the BLS transition in Fedora 30 in things like Fedora 30: When grub2-mkconfig Doesn’t Work, and My experiences upgrading to Fedora 30. Without that initial awareness of the existence of the BLS transition in Fedora 30 (and the problems it caused people), I might have been flailing around for even longer than I was.
PS: As a result of all of this, I've discovered that you no longer
need to specify the root device in the kernel command line arguments.
I assume the necessary information for that is in the dracut-built
initramfs. As far as the blank screen and kernel panics go, I suspect
that the cause is either or both of 'amdgpu.dpm=0
' and 'logo.nologo
',
which were still present in the /etc/default/grub arguments but
which I'd long since removed from my actual kernel command lines.
(I could conduct more experiments to try to find out which kernel argument is the fatal one, but my interest in more reboots is rather low.)
Update, August 21st: I needed to reboot my machine to apply a Fedora
kernel update, so I did some experiments and the fatal kernel command
line argument is amdgpu.dpm=0
, which I needed when the machine was
new but had turned off since then.
Systemd and waiting until network interfaces or addresses are configured
One of the things that systemd is very down on is the idea of running services after 'the network is up', whatever that means; the systemd people have an entire web page on the subject. This is all well and good in theory, but in practice there are plenty of situations where I need to only start certain things after either a named network interface is present or an IP address exists. For a concrete example, you can't set up various pieces of policy based routing for an interface until the interface actually exists. If you're configuring this on boot in a systemd based system (especially one using networkd), you need some way to insure the ordering. Similarly, sometimes you need to listen only on some specific IP addresses and the software you're using doesn't have Linux specific hacks to do that when the IP address doesn't exist yet.
(As a grumpy sysadmin, I actually don't like the behavior of binding to an IP address that doesn't exist, because it means that daemons will start and run even if the system will never have the IP address. I would much rather delay daemon startup until the IP address exists.)
Systemd does not have direct native support for any of this, of course. There's no way to directly say that you depend on an interface or an IP address, and in general the dependency structure has long been under-documented. The closest you can get to waiting until a named network interface exists is to specify an After= and perhaps a Want= or a Requires= on the pseudo-unit for the network interface, 'sys-subsystem-net-devices-<iface>.device'. However, as I found out, the lack of a .device unit doesn't always mean that the interface doesn't exist.
You might think that in order to wait for an IP address to exist, you could specify an After= for the .device unit it's created in and by. However, this has historically had issues for me; under at least some versions of systemd, the .device unit would be created before the IP address was configured. In my particular situation, what worked at the time was to wait for a VLAN interface .device that was on top of the real interface that had the IP address (and yes, I mix tagged VLANs with an untagged network). By the time the VLAN .device existed, the IP address had relatively reliably been set up.
If you're using systemd-networkd and care about network interfaces, the easiest approach is probably to rely on systemd-networkd-wait-online.service; how it works and what it waits for is probably about as good as you can get. For IP addresses, as far as I know there's no native thing that specifically waits until some or all of your static IP addresses are present. Waiting for systemd-networkd-wait-online is probably going to be good enough for most circumstances, but if I needed better I would probably write a shell script (and a .service unit for it) that simply waited until the IP addresses I needed were present.
(I continue to think that it's a real pity that you can't configure networkd .network files to have 'network up' and 'network down' scripts, especially since their stuff for routing and policy based routing is really very verbose.)
PS: One of the unfortunate effects of the under-documented dependency structure and the lack of clarity of what to wait on is a certain amount of what I will call 'superstitious dependencies', things that you've put into your systemd units without fully understanding whether or not you needed them, and why (often also without fully documenting them). This is fine most of the time, but then one day an unnecessary dependency fails to start or perhaps exist and then you're unhappy. That's part of why I would like explicit and reliable ways to do all of this.