A new and exciting failure mode for Linux UEFI booting
My work laptop boots with UEFI (and Secure Boot) instead of the traditional MBR BIOS booting, because that's what makes modern laptops (and modern versions of Windows) happy. Since it only has a single disk anyway, some of the drawbacks of UEFI booting don't apply to it. However, today I got to discover a new and exciting failure mode of UEFI booting (at least in the Fedora configuration), which is a damaged UEFI system partition FAT32 filesystem. Unfortunately both identifying this problem and fixing it are much harder than you would like, partly because GRUB 2 seems to omit reporting error messages when things go wrong loading grub.cfg.
What happened is that I powered on my laptop as normal this morning,
and when I looked back at it a bit later it was sitting there with
just a '
grub>' prompt. Some flailing around with picking an
alternate UEFI boot entry in the Dell BIOS established that my
Windows install could boot. Some poking around in the GRUB 2 shell
established that GRUB could see everything that I expected, but it
grub.cfg from the UEFI system partition, although
nothing seemed to complain (including when I manually used the
configfile' command to try to load it). Eventually I used Grub's
cat command to just dump the
grub.cfg, even though trying to
load it was producing no errors, and at that point GRUB printed
part of the file and stopped with an error about FAT32 problems.
(I don't remember the exact message at this point.)
Recovery from this started with putting together a Fedora 29 live
USB stick (a more irritating process than it should be) and booting
from it. My first step was to run
fsck against the UEFI system
partition, in which I made a mistake; when it identified various
problems, including with
grubenv, I confidently
told it to go ahead and fix things without carefully reading its
proposed fixes. The FAT32
fsck promptly truncated
0 size, losing all of the somewhat intact contents that I could
have used to boot the system with. Fixing that required setting up
a chroot environment with enough things mounted that
could run but not so many that it would hang when run (apparently
/sys present made this happen), rebooting with this somewhat
damaged grub.cfg, and then re-doing the
grub2-mkconfig to get a
new fully proper GRUB 2 config file.
(To recreate a proper
grubenv, the magic incantation is
grub2-editenv grubenv create'. GRUB 2 will complain on every boot if
you don't do this.)
As far as I can remember, I did nothing unusual with my laptop recently, although I did do a Fedora kernel upgrade (and reboot) and boot Windows to check for updates to it. There were no crashes, no abrupt or forced power-offs, no nothing that ought to have corrupted any filesystem, much less an infrequently touched UEFI system partition. But it did get corrupted. Sadly, in one sense this doesn't surprise me, because FAT32 has a reputation as a fragile file system, especially if different things update it, since various different OSes (and tools) have different FAT32 filesystem code.
(One of the strong recommendations for the FAT32 formatted memory cards used in digital cameras, for example, is that they should be formatted in the camera and only ever written to by the camera. Otherwise you risk the camera not coping with something your computer does to the filesystem or vice versa.)
Part of this issue is due to the choice to put grub.cfg into the UEFI system partition (which is not universal, see the comments on that entry). Grub.cfg is a frequently updated file, and the more often you modify a fragile filesystem the more chances you have for a problem. I don't think it's a coincidence that both grub.cfg and grubenv were damaged.
Sidebar: Why I didn't try to boot a kernel by hand from GRUB
I had two reasons for this. First, at the time I wasn't sure if
my root filesystem was intact either or if I had more widespread
issues than a problem on the UEFI system partition. Second, have
you looked at the command lines required to boot a modern kernel?
I can't possibly remember everything that goes on one or produce
it from scratch, and my laptop's
/proc/cmdline seems to be one
of the shorter ones. Specifically, it is:
BOOT_IMAGE=/vmlinuz-5.0.3-200.fc29.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet
Some of that I could probably leave out, like
in this situation I probably want to leave out '
rhgb quiet'. But
the rest clearly matters, and I didn't have another stock Fedora
system around for reference on what it should look like.