Wandering Thoughts archives

2020-11-26

The better way to make an Ubuntu 20.04 ISO that will boot on UEFI systems

Yesterday I wrote about how I made a 20.04 ISO that booted on UEFI systems. It was a messy process with some peculiar things that I didn't understand and places where I had to deviate from Debian's excellent documentation on Repacking a Debian ISO. In response to my entry, Thomas Schmitt (the author of xorriso) got in touch with me and very generously helped me figure out what was really going on. The short version is that I was confused and my problems were due to some underlying issues. So now I have had some learning experiences and I have a better way to do this.

First, I've learned that you don't want to extract ISO images with 7z, however tempting and easy it seems. 7z has at least two issues with ISO images; it will quietly add the El Torito boot images to the extracted tree, in a new subdirectory called '[BOOT]', and it doesn't extract symlinks (and probably not other Rock Ridge attributes). The Ubuntu 20.04.1 amd64 live server image has some symlinks, although their presence isn't essential.

The two reliable ways I know of to extract the 20.04.1 ISO image are with bsdtar (part of the libarchive-tools package in Ubuntu) and with xorriso itself. Bsdtar is easier to use but you probably don't have it installed, while you need xorriso anyway and might as well use it for this once you know how. So to unpack the ISO into our scratch tree, you want:

xorriso -osirrox on -indev example.iso -extract / SCRATCH-TREE

(See the Debian wiki for something you're going to want to do afterward to delete the tree. Substitute whatever is the correct ISO name here in place of example.iso.)

As I discovered due to my conversation with Thomas Schmitt, it can be important to re-extract the tree any time you think something funny is going on. My second issue was that my tree's boot/grub/efi.img had been quietly altered by something in a way that removed its FAT signature and made UEFI systems refuse to recognize it (I suspect some of my experimentation with mkisofs did it, but I don't know for sure).

In a re-extracted tree with a pristine boot/grub/efi.img, the tree's efi.img was valid as an El Torito EFI boot image (and the isolinux.bin is exactly what was used for the original 20.04.1 ISO's El Torito BIOS boot image). So the command to rebuild an ISO that is bootable both as UEFI and BIOS, both as a DVD image and on a USB stick, is:

xorriso -as mkisofs -r \
  -V 'Our Ubuntu 20.04 UEFI enabled' \
  -o cslab_ubuntu_20.04.iso \
  -isohybrid-mbr isohdpfx.bin \
  -J -joliet-long \
  -b isolinux/isolinux.bin -c isolinux/boot.cat \
  -boot-load-size 4 -boot-info-table -no-emul-boot \
  -eltorito-alt-boot -e boot/grub/efi.img -no-emul-boot \
  -isohybrid-gpt-basdat \
  SCRATCH-TREE

(The isohdpfx.bin file is generated following the instructions in the Debian wiki page. This entire command line is pretty much what the Debian wiki says to do.)

If xorriso doesn't complain that some symlinks can't be represented in a Joliet file name tree, you haven't extracted the 20.04.1 ISO image exactly; something has dropped the symlinks that should be there.

If you're modifying the ISO image to provide auto-installer data, you need to change both isolinux/txt.cfg and boot/grub/grub.cfg. The necessary modifications are covered in setting up a 20.04 ISO image to auto-install a server (for isolinux) and then yesterday's entry (for GRUB). You may also want to add various additional files and pieces of data to the ISO, which can be done by dropping them into the unpacked tree.

(It's also apparently possible to update the version of the installer that's in the ISO image, per here, but the make-edge-iso.sh and inject-subiquity-snap.sh scripts it points to in the subiquity repo are what I would call not trivial and so are beyond what I want to let monkey around in our ISO trees. I've already done enough damage without realizing it in my first attempts. I'll just wait for 20.04.2.)

On the whole this has been a learning experience about not questioning my assumptions and re-checking my work. I have the entire process of preparing the extracted ISO's scratch tree more or less automated, so at any time I could have deleted the existing scratch tree, re-extracted the ISO (even with 7z), and managed to build a working UEFI booting ISO with boot/grub/efi.img. But I just assumed that the tree was fine and hadn't been changed by anything, and I never questioned various oddities until later (including the '[BOOT]' subdirectory, which wasn't named like anything else on the ISO image).

Ubuntu2004ISOWithUEFI-2 written at 23:39:15; Add Comment

Making an Ubuntu 20.04 ISO that will boot on UEFI systems

As part of our overall install process, for years we've used customized Ubuntu server install images (ie, ISOs, often burned on to actual DVDs) that were set up with preseed files for the Debian installer and a few other things we wanted on our servers from the start. These ISOs have been built in the traditional way with mkisofs and so booted with isolinux. This was fine for a long time because pretty much all of our servers used traditional MBR BIOS booting, which is what ISOs use isolinux for. However, for or reasons outside the scope of this entry, today we wanted to make our 20.04 ISO image also boot on systems using UEFI boot. This turned out to be more complicated than I expected.

(For basic background on this, see my earlier entry on setting up a 20.04 ISO image to auto-install a server.)

First, as my co-workers had already discovered long ago, Linux ISOs do UEFI booting using GRUB2, not isolinux, which means that you need to customize the grub.cfg file in order to add the special command line parameters to tell the installer about your 20.04 installer data. We provide the installer data in the ISO image, which means that our kernel command line arguments contain a ';'. In GRUB2, I discovered that this must be quoted:

menuentry "..." {
  [...]
  linux /casper/vmlinuz quiet "ds=nocloud;s=/cdrom/cslab/inst/" ---
  [...]
}

(I advise you to modify the title of the menu entries in the ISO's grub.cfg so that you know it's using your modified version. It's a useful reassurance.)

If you don't do this quoting, all the kernel (and the installer) see is a 'ds=nocloud' argument. Your installer data will be ignored (despite being on the ISO image) and you may get confused about what's wrong.

The way ISOs are made bootable is that they have at least one El Torito boot section (see also the OsDev Wiki). A conventional BIOS bootable ISO has one section; one that can also be booted through UEFI has a second one that is more intricate. You can examine various information about El Torito boot sections with dumpet, which is in the standard Ubuntu repositories.

In theory I believe mkisofs can be used to add a suitable extra ET boot section. In practice, everyone has switched to building ISO images with xorriso, for good reason. The easiest to follow guide on using xorriso for this is the Debian Wiki page on Repacking a Debian ISO, which not only has plenty of examples but goes the extra distance to explain what the many xorriso arguments mean and do (and why they matter). This is extremely useful since xorriso has a large and complicated manpage and other documentation.

Important update: The details of much of the rest of this entry turns out to not be right, because I had a corrupted ISO tree with altered files. For a better procedure and more details, see The better way to make an Ubuntu 20.04 ISO that will boot on UEFI systems. The broad overview of UEFI requiring a GRUB2 EFI image is accurate, though.

However, Ubuntu has a surprise for us (of course). UEFI bootable Linux ISOs need a GRUB2 EFI image that is embedded into the ISO. Many examples, including the Debian wiki page, get this image from a file in the ISO image called boot/grub/efi.img. The Ubuntu 20.04.1 ISO image has such a file, but it is not actually the correct file to use. If you build an ISO using this efi.img as the El Torito EFI boot image, it will fail on at least some UEFI systems. The file you actually want to use turns out to be '[BOOT]/2-Boot-NoEmul.img' in the ISO image.

(Although the 20.04.1 ISO image's isolinux/isolinux.bin works fine as the El Torito BIOS boot image, it also appears to not be what the original 20.04.1 ISO was built with. The authentic thing seems to be '[BOOT]/1-Boot-NoEmul.img'. I'm just thankful that Ubuntu put both in the ISO image, even if it sort of hid them.)

Update: These '[BOOT]' files aren't in the normal ISO image itself, but are added by 7z (likely from the El Torito boot sections) when it extracts the ISO image into a directory tree for me. The isolinux.bin difference is from a boot info table that contains the block offsets of isolinux.bin in the ISO. The efi.img differences are currently more mysterious.

The resulting xorriso command line I'm using right now is more or less:

xorriso -as mkisofs -r \
  -V 'Our Ubuntu 20.04 UEFI enabled' \
  -o cslab_ubuntu_20.04.iso \
  -isohybrid-mbr isohdpfx.bin \
  -J -joliet-long \
  -b isolinux/isolinux.bin -c isolinux/boot.cat \
  -boot-load-size 4 -boot-info-table -no-emul-boot \
  -eltorito-alt-boot -e '[BOOT]/2-Boot-NoEmul.img' -no-emul-boot \
  -isohybrid-gpt-basdat \
  SCRATCH-DIRECTORY

(assuming that SCRATCH-DIRECTORY is your unpacked and modified version of the 20.04.1 ISO image, and isohdpfx.bin is generated following the instructions in the Debian wiki page.)

The ISO created through this definitely boots in VMWare in both UEFI and BIOS mode (and installs afterward). I haven't tried it in UEFI mode on real hardware yet and probably won't for a while.

PS: If you use the Debian wiki's suggested xorriso command line to analyze the 20.04.1 ISO image, it will claim that the El Torito EFI boot image is 'boot/grub/efi.img'. This is definitely not the case, which you can verify by using dumpet to extract both of the actual boot images from the ISO and then cmp to see what they match up with.

Ubuntu2004ISOWithUEFI written at 00:56:13; Add Comment

2020-11-14

Linux servers can still wind up using SATA in legacy PATA mode

Over the course of yesterday and today, I've been turning over a series of rocks that led to a discovery:

One of the things that the BIOS on this machine (and others [of our servers]) is apparently doing is setting the SATA ports to legacy IDE/ata_piix mode instead of AHCI mode. I wonder how many driver & hardware features we're missing because of that.

(The 'ata_piix' kernel module is the driver for legacy mode, while 'ahci' module is the driver for AHCI SATA. If you see boot time messages from ata_piix, you should be at least nervous.)

Modern SATA host controllers have two different modes, AHCI, which supports all of the features of SATA, and legacy Paralle ATA emulation (aka IDE mode), where your SATA controller pretends to be an old IDE controller. In the way of modern hardware, how your host controller presents itself is chosen by the BIOS, not your operating system (or at least not Linux). Most modern BIOSes probably default to AHCI mode, which is what you want, but apparently some of our machines either default to legacy PATA or got set that way at some point.

The simplest way to see if you've wound up in this situation is to use lsblk to see what it reports as the 'TRAN' field (the transport type); it will be 'sata' for drives behind controllers in AHCI mode, and 'ata' for legacy PATA support. On one affected machine, we see:

; lsblk -o NAME,HCTL,TRAN,MODEL --nodeps /dev/sd?
NAME HCTL       TRAN MODEL
sda  0:0:0:0    ata  WDC WD5000AAKX-0

Meanwhile, on a machine that's not affected by this, we see:

; lsblk -o NAME,HCTL,TRAN,MODEL --nodeps /dev/sd?
NAME HCTL       TRAN   MODEL
sda  0:0:0:0    sata   WDC WD5003ABYX-1
sdb  1:0:0:0    sata   ST500NM0011

It's otherwise very easy to not notice that your system is running in PATA mode instead of AHCI (at least until you attempt to hot-swap a failed drive; only AHCI supports that). I'm not sure what features and performance you miss out on in legacy PATA mode, but one of them is apparently Native Command Queueing. I suspect that there also are differences in error recovery if a drive has bad sectors or other problems, at least if you have three or four drives so that the system has to present two drives as being on the same ATA channel.

Based on our recent experience, my strong belief is now that your system BIOS is much more likely to play around with the order of hard drives if your SATA controller is in legacy mode. A SATA controller in AHCI mode is hopefully presenting an honest view of what drive is cabled to what port; as we've found out, this is not necessarily the case in legacy mode, perhaps because the BIOS always has to establish some sort of mapping between SATA ports and alleged IDE channels.

(SATA ports can be wired up oddly and not as you expect for all sorts of reasons, but at least physical wiring stays put and is thus consistent over time. BIOSes can change their minds if they feel like it.)

(For more on AHCI, see also the Arch Wiki and the OSDev wiki.)

ServerSATAInATAMode written at 00:16:12; Add Comment

2020-11-13

If you use Exim on Ubuntu, you probably want to skip Ubuntu 20.04

The Exim MTA (Mail Transfer Agent, aka mailer) recently added a mandatory new security feature to 'taint' data taken directly from the outside world, with the goal of reducing the potential for future issues like CVE-2019-13917. Things that are tainted include not just obvious things like the contents of message headers, but also slightly less obvious things like the source and especially destination addresses of messages, both their domains and their local parts. There are many common uses of now-tainted data in many parts of delivering messages; for example, writing mail to '/var/mail/$local_part' involves use of tainted data (even if you've verified that the local address exists as a user). In order to still be usable, Exim supports a variety of methods to generate untainted versions of this tainted data.

Exim introduced tainting in Exim 4.93, released in December of 2019. Unfortunately this version's support for tainting is flawed, and part of the flaws are that a significant number of methods of de-tainting data don't work. It's probably possible to craft an Exim 4.93 configuration that works properly with tainted data, but it is going to be a very ugly and artificial configuration. Exim 4.94 improves the situation significantly, but even then apparently you should use it with additional fixes.

Ubuntu 20.04 ships a somewhat patched version of Exim 4.93, but it has significant de-tainting flaws and limitations which mean that you don't want to use it in its current state. As is normal and traditional, there's essentially no prospect that Ubuntu will update to Exim 4.94+ over the lifetime of Ubuntu 20.04; what we have today in 20.04 is what we get. As a result, if you use Exim on Ubuntu, I think that you should skip 20.04. Run your Exim machines on 18.04 LTS until 22.04 LTS comes out with a hopefully much better version of Exim.

If you absolutely must run Ubuntu 20.04 with some version of Exim, I don't recommend building your own from upstream sources because that has inherent problems. The Debian source packages for 4.94 (from testing and unstable) appear to rebuild and work fine on Ubuntu 20.04, so I'd suggest starting from them. Possibly you could even use the Debian binary packages, although I haven't tried that and would be somewhat wary.

(It's possible that someone will put together a PPA for the Debian packages rebuilt on Ubuntu 20.04. It won't be me, as we're skipping 20.04 for our Exim machines. It's also possible that someone will get the Exim 4.94 package from Ubuntu 20.10 included in the 20.04 Ubuntu Backports. Anyone can make the request, after all (but it won't be us).)

Ubuntu2004EximSkip written at 00:16:59; Add Comment

2020-11-07

Turning on console blanking on a Linux machine when logged in remotely

When my office workstation is running my X session, I lock the screen if it's idle, which blanks the display. However, if the machine winds up idle in text mode, the current Linux kernel defaults keep the display unblanked. Normally this never happens, because I log in and start X immediately after I reboot the machine and then leave it sitting there. However, ongoing world and local events have me working from home and remotely rebooting my office workstation for kernel upgrades and even Fedora version upgrades. When my office workstation reboots, it winds up in the text console and then just sits there, unblanked and displaying boot messages and a text mode login prompt. Even when I come into the office and use it, I now log out afterward (because I know I'm going to remotely reboot it later).

(Before I started writing this entry I thought I had set my machine to deliberately never blank the text mode display, for reasons covered here, but this turned out not to be the case; it's just the current kernel default.)

Leaving modern LCD panels active with more or less static text being displayed is probably not harmful (although maybe not). Still, I feel happier if the machine's LCD panels are actually blanked out in text mode. Fortunately you can do this while logged in remotely, although it is slightly tricky.

As I mentioned yesterday, the kernel's console blanking timeout is reported in /sys/module/kernel/parameters/consoleblank. Unfortunately this sysfs parameter is read-only, and you can't just change the blanking time by writing to it (which would be the most convenient way). Instead you have to use the setterm program, but there are two tricks because of how the setterm program works.

If you just log in remotely and run, say, 'setterm -blank 5', you will get an error message:

# setterm -blank 5
setterm: terminal xterm does not support --blank

The problem is that setterm works not by making kernel calls, but by writing out a character string that will make the kernel's console driver change things appropriately. This means that it needs to be writing to the console and also it needs to be told the correct terminal type so that it can generate the correct escape sequences. To do this we need to run:

TERM=linux setterm -blank 5 >/dev/tty1

The terminal type 'linux' is the type of text consoles. The other type for this is apparently 'con', according to a check that is hard-coded in setterm.c's init_terminal().

(And I lied a bit up there. Setterm actually hard codes the escape sequence for setting the blanking time, so the only thing it uses $TERM for is to decide if it's willing to generate the escape sequence or if it will print an error. See set_blanking() for the escape code generation.)

The process for seeing if blanking is on (or forcing blanking and unblanking) is a bit different, because here setterm actually makes Linux specific ioctl() calls but it does them on its standard input, not its standard output. So we have to do:

TERM=linux setterm -blank </dev/tty1

This will print 0 or 1 depending on if the console isn't currently blanked or is currently blanked. I believe you can substitute any console tty for /dev/tty1 here.

ConsoleBlankingRemotely written at 21:58:50; Add Comment

2020-11-06

Console blanking now defaults to off on Linux (and has for a while)

For a long time, if you left a Linux machine sitting idle at a text console, for example on a server, the kernel would blank the display after a while. Years ago I wrote an entry about how you wanted to turn this off on your Linux servers, where at the time the best way to do this was a kernel parameter. For reasons beyond the scope of this entry, I recently noticed that we were not setting this kernel parameter on our Ubuntu 18.04 servers yet I knew that they weren't blanking their consoles.

(Until I looked at their /proc/cmdline, I thought we had just set 'consoleblank=0' as part of their standard kernel command line parameters.)

It turns out that the kernel's default behavior here changed back in 2017, ultimately due to this Ubuntu bug report. That bug led to this kernel change (which has a nice commit message explaining everything), which took it from an explicit ten minutes to implicitly being disabled (a C global variable without an explicit initializer is zero). Based on some poking at the git logs, it appears that this was introduced in 4.12, which means that it's in Ubuntu 18.04's kernel but not 16.04's.

(You can tell what the current state of this timeout is on any given machine by looking at /sys/module/kernel/parameters/consoleblank. It's 0 if this is disabled, and otherwise the number of seconds before the text console blanks.)

We have remaining Ubuntu 16.04 machines but they're all going away within a few months (one way or another), so it's not worth fixing their console blanking situation now that I've actually noticed it. Working from home due to ongoing events makes that a simpler choice, since if a machine locks up we're not going to go down to the machine room to plug in a monitor and look at its console; we're just going to remotely power cycle it as the first step.

(Our default kernel parameters tend to have an extremely long lifetime. We're still automatically setting a kernel parameter to deal with a problem we ran into ino Ubuntu 12.04. At this point I have no idea if that old problem still happens on current kernels, but we might as well leave it there just in case.)

ConsoleBlankingDefaultsOff written at 22:50:49; Add Comment

2020-11-04

You shouldn't use the Linux dump program any more (on extN filesystems)

When I upgraded my office workstation to Fedora 32, one of the things that happened is that Amanda backups of its root filesystem stopped working. The specific complaint from Amanda was a report of:

no size line match in /usr/lib64/amanda/rundump (xfsdump) output

This happened because of Fedora bug 1830320, adequately summarized as "ext4 filesystem dumped with xfsdump instead of dump". The cause of this is that Fedora 32's Amanda RPMs are built without the venerable dump program and so do not try to use it. Instead, if you tell Amanda to back up a filesystem using the abstract program "DUMP", Amanda always uses xfsdump regardless of what the filesystem type is, and naturally xfsdump fails on extN filesystems.

I have historically used various versions of the Unix *dump family of programs because I felt that a filesystem specific tool was generally going to do the best job of fully backing up your filesystem, complete with whatever peculiar things it had (starting with holes in your files). ZFS has no zfsdump (although I wish that it did), so most of my workstation's filesystems are backed up with tar, but my root filesystem is an extN one and I used dump. Well, I used to use dump.

At first I was irritated with Fedora packaging and planned to say grumpy things about it. But then I read more, and discovered that this Amanda change is actually a good idea, because using Linux dump isn't a good idea any more. The full story is in Fedora bug 1884602, but the short version is that dump hasn't been updated to properly handle modern versions of extN filesystems and won't be, because it's unmaintained. To quote the bug:

Looking at the code it is very much outdated and will not support current ext4 features, in some cases leading to corrupted files without dump/restore even noticing any problems.

Fedora is currently planning to keep the restore program around so that you can restore any dump archives you have, which I fully support (especially since the Linux restore is actually pretty good at supporting various old dump formats from other systems, which can be handy).

I have some reflexes around using 'dump | restore' pipelines to copy extN filesystems around (eg, and also), which I now need to change. Probably tar is better than rsync for this particular purpose.

(I'll miss dump a bit, but a backup program that can silently produce corrupted backups is not a feature.)

PS: dump is a completely different thing than dumpe2fs; the former makes backups and the latter tells you details about your extN filesystem. Dumpe2fs is part of e2fsprogs and naturally remains under active development as part of extN development.

ExtNDumpDeprecated written at 00:30:29; Add Comment

2020-11-03

Fixing blank Cinnamon sessions in VMWare virtual machines (on Fedora)

I periodically install and maintain versions of Fedora under VMWare (on my Fedora office machine). When I do this, I invariably opt to have them run Cinnamon, because out of the stock desktop environments, Cinnamon is what I've preferred for a long time. For a while now I've been having an ongoing problem with this, which is that my Cinnamon sessions don't work, although every other type of session does (Gnome, classic Gnome, and random other window managers that Fedora offers as additional options).

(Some last minute experimentation just now suggests that it may also happen with Fedora's MATE desktop.)

The specific problem is that when I log in under Cinnamon, all I get in the VMWare GUI is a black screen (apparently a blank one, based on some clues). This blank screen happens in the VMWare GUI itself and in a VNC session connected to the virtual machine's console. If I arrange to take a picture of the X root window, it has the Cinnamon desktop rendered on it, so Cinnamon appears to be working and running and so on; it's just that something is happening so that the results are not displayed. This happens in most VMs, but not in all of them; specifically, my oldest Fedora VM started out as Fedora 26 and it doesn't seem to have the problem.

(I keep this old VM around because it's my test dummy for a VM that's close to my real desktop, so I can use it to test Fedora upgrades in a VM with a ZFS pool and so on. The other Fedoras are generally from-scratch installs of the current Fedora, which I upgrade once as a test and then discard in favour of a fresh install of the new Fedora.)

I will cut to the chase: this appears to be due to using the VMWare Xorg video driver, which in Fedora is the xorg-x11-drv-vmware package. If I remove this package, Cinnamon (and MATE) work, with the Xorg server falling back to some other driver (I think it's the 'fb' driver, although it's hard for me to parse through the X server logs). In theory this gives me unaccelerated graphics; in practice I can't tell the difference, especially these days since it's all going over a 'ssh -X' connection anyway.

I have no idea why this is happening; I've looked for error messages and other signs of problems and haven't found anything. Since this involves VMWare, I haven't bothered to report it to anyone (for anything involving VMWare on Linux, you get to keep all of the pieces if it breaks). I'm apparently not the only person having this sort of issue with Cinnamon on VMWare, because I found this solution on the Internet in some forum thread that I forgot to save, can't find again, and so regrettably can't credit.

PS: For my future reference, the Xorg logs are in .local/share/xorg.

CinnamonInVMWareFix written at 00:35:34; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.