A new and exciting failure mode for Linux UEFI booting

March 27, 2019

My work laptop boots with UEFI (and Secure Boot) instead of the traditional MBR BIOS booting, because that's what makes modern laptops (and modern versions of Windows) happy. Since it only has a single disk anyway, some of the drawbacks of UEFI booting don't apply to it. However, today I got to discover a new and exciting failure mode of UEFI booting (at least in the Fedora configuration), which is a damaged UEFI system partition FAT32 filesystem. Unfortunately both identifying this problem and fixing it are much harder than you would like, partly because GRUB 2 seems to omit reporting error messages when things go wrong loading grub.cfg.

What happened is that I powered on my laptop as normal this morning, and when I looked back at it a bit later it was sitting there with just a 'grub>' prompt. Some flailing around with picking an alternate UEFI boot entry in the Dell BIOS established that my Windows install could boot. Some poking around in the GRUB 2 shell established that GRUB could see everything that I expected, but it wasn't loading grub.cfg from the UEFI system partition, although nothing seemed to complain (including when I manually used the 'configfile' command to try to load it). Eventually I used Grub's cat command to just dump the grub.cfg, even though trying to load it was producing no errors, and at that point GRUB printed part of the file and stopped with an error about FAT32 problems.

(I don't remember the exact message at this point.)

Recovery from this started with putting together a Fedora 29 live USB stick (a more irritating process than it should be) and booting from it. My first step was to run fsck against the UEFI system partition, in which I made a mistake; when it identified various problems, including with grub.cfg and grubenv, I confidently told it to go ahead and fix things without carefully reading its proposed fixes. The FAT32 fsck promptly truncated grub.cfg to 0 size, losing all of the somewhat intact contents that I could have used to boot the system with. Fixing that required setting up a chroot environment with enough things mounted that grub2-mkconfig could run but not so many that it would hang when run (apparently having /sys present made this happen), rebooting with this somewhat damaged grub.cfg, and then re-doing the grub2-mkconfig to get a new fully proper GRUB 2 config file.

(To recreate a proper grubenv, the magic incantation is 'grub2-editenv grubenv create'. GRUB 2 will complain on every boot if you don't do this.)

As far as I can remember, I did nothing unusual with my laptop recently, although I did do a Fedora kernel upgrade (and reboot) and boot Windows to check for updates to it. There were no crashes, no abrupt or forced power-offs, no nothing that ought to have corrupted any filesystem, much less an infrequently touched UEFI system partition. But it did get corrupted. Sadly, in one sense this doesn't surprise me, because FAT32 has a reputation as a fragile file system, especially if different things update it, since various different OSes (and tools) have different FAT32 filesystem code.

(One of the strong recommendations for the FAT32 formatted memory cards used in digital cameras, for example, is that they should be formatted in the camera and only ever written to by the camera. Otherwise you risk the camera not coping with something your computer does to the filesystem or vice versa.)

Part of this issue is due to the choice to put grub.cfg into the UEFI system partition (which is not universal, see the comments on that entry). Grub.cfg is a frequently updated file, and the more often you modify a fragile filesystem the more chances you have for a problem. I don't think it's a coincidence that both grub.cfg and grubenv were damaged.

Sidebar: Why I didn't try to boot a kernel by hand from GRUB

I had two reasons for this. First, at the time I wasn't sure if my root filesystem was intact either or if I had more widespread issues than a problem on the UEFI system partition. Second, have you looked at the command lines required to boot a modern kernel? I can't possibly remember everything that goes on one or produce it from scratch, and my laptop's /proc/cmdline seems to be one of the shorter ones. Specifically, it is:

BOOT_IMAGE=/vmlinuz-5.0.3-200.fc29.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet

Some of that I could probably leave out, like BOOT_IMAGE, and in this situation I probably want to leave out 'rhgb quiet'. But the rest clearly matters, and I didn't have another stock Fedora system around for reference on what it should look like.

Written on 27 March 2019.
« Drifting away from OmniOS (CE)
My NVMe versus SSD uncertainty (and hesitation) »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Mar 27 22:05:15 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.