2024-11-04
A rough equivalent to "return to last power state" for libvirt virtual machines
Physical machines can generally be set in their BIOS so that if power is lost and then comes back, the machine returns to its previous state (either powered on or powered off). The actual mechanics of this are complicated (also), but the idealized version is easily understood and convenient. These days I have a revolving collection of libvirt based virtual machines running on a virtualization host that I periodically reboot due to things like kernel updates, and for a while I have quietly wished for some sort of similar libvirt setting for its virtual machines.
It turns out that this setting exists, sort of, in the form of the libvirt-guests systemd service. If enabled, it can be set to restart all guests that were running when the system was shut down, regardless of whether or not they're set to auto-start on boot (none of my VMs are). This is a global setting that applies to all virtual machines that were running at the time the system went down, not one that can be applied to only some VMs, but for my purposes this is sufficient; it makes it less of a hassle to reboot the virtual machine host.
Linux being Linux, life is not quite this simple in practice, as is illustrated by comparing my Ubuntu VM host machine with my Fedora desktops. On Ubuntu, libvirt-guests.service defaults to enabled, it is configured through /etc/default/libvirt-guests (the Debian standard), and it defaults to not not automatically restarting virtual machines. On my Fedora desktops, libvirt-guests.service is not enabled by default, it is configured through /etc/sysconfig/libvirt-guests (as in the official documentation), and it defaults to automatically restarting virtual machines. Another difference is that Ubuntu has a /etc/default/libvirt-guests that has commented out default values, while Fedora has no /etc/sysconfig/libvirt-guests so you have to read the script to see what the defaults are (on Fedora, this is /usr/libexec/libvirt-guests.sh, on Ubuntu /usr/lib/libvirt/libvirt-guests.sh).
I've changed my Ubuntu VM host machine so that it will automatically restart previously running virtual machines on reboot, because generally I leave things running intentionally there. I haven't touched my Fedora machines so far because by and large I don't have any regularly running VMs, so if a VM is still running when I go to reboot the machine, it's most likely because I forgot I had it up and hadn't gotten around to shutting it off.
(My pre-libvirt virtualization software was much too heavy-weight for me to leave a VM running without noticing, but libvirt VMs have a sufficiently low impact on my desktop experience that I can and have left them running without realizing it.)
2024-10-31
Pam_unix and your system's supported password algorithms
The Linux login passwords that wind up in /etc/shadow
can be
encrypted (well, hashed) with a variety of algorithms, which you
can find listed (and sort of documented) in places like Debian's
crypt(5) manual page.
Generally the choice of which algorithm is used to hash (new)
passwords (for example, when people change them) is determined by
an option to the pam_unix PAM
module.
You might innocently think, as I did, that all of the algorithms your system supports will all be supported by pam_unix, or more exactly will all be available for new passwords (ie, what you or your distribution control with an option to pam_unix). It turns out that this is not the case some of the time (or if it is actually the case, the pam_unix manual page can be inaccurate). This is surprising because pam_unix is the thing that handles hashed passwords (both validating them and changing them), and you'd think its handling of them would be symmetric.
As I found out today, this isn't necessarily so. As documented in the Ubuntu 20.04 crypt(5) manual page, 20.04 supports yescrypt in crypt(3) (sadly Ubuntu's manual page URL doesn't seem to work). This means that the Ubuntu 20.04 pam_unix can (or should) be able to accept yescrypt hashed passwords. However, the Ubuntu 20.04 pam_unix(8) manual page doesn't list yescrypt as one of the available options for hashing new passwords. If you look only at the 20.04 pam_unix manual page, you might (incorrectly) assume that a 20.04 system can't deal with yescrypt based passwords at all.
At one level, this makes sense once you know that pam_unix and crypt(3) come from different packages and handle different parts of the work of checking existing Unix password and hashing new ones. Roughly speaking, pam_unix can delegate checking passwords to crypt(3) without having to care how they're hashed, but to hash a new password with a specific algorithm it has to know about the algorithm, have a specific PAM option added for it, and call some functions in the right way. It's quite possible for crypt(3) to get ahead of pam_unix for a new password hashing algorithm, like yescrypt.
(Since they're separate packages, pam_unix may not want to implement this for a new algorithm until a crypt(3) that supports it is at least released, and then pam_unix itself will need a new release. And I don't know if linux-pam can detect whether or not yescrypt is supported by crypt(3) at build time (or at runtime).)
PS: If you have an environment with a shared set of accounts and passwords (whether via LDAP or your own custom mechanism) and a mixture of Ubuntu versions (maybe also with other Linux distribution versions), you may want to be careful about using new password hashing schemes, even once it's supported by pam_unix on your main systems. The older some of your Linuxes are, the more you'll want to check their crypt(3) and crypt(5) manual pages carefully.
2024-10-27
Linux's /dev/disk/by-id unfortunately often puts the transport in the name
Filippo Valsorda ran into an issue that involved, in part, the naming of USB disk drives. To quote the relevant bit:
I can't quite get my head around the zfs import/export concept.
When I replace a drive I like to first resilver the new one as a USB drive, then swap it in. This changes the device name (even using by-id).
[...]
My first reaction was that something funny must be going on. My second reaction was to look at an actual /dev/disk/by-id with a USB disk, at which point I got a sinking feeling that I should have already recognized from a long time ago. If you look at your /dev/disk/by-id, you will mostly see names that start with things like 'ata-', 'scsi-OATA-', 'scsi-1ATA', and maybe 'usb-' (and perhaps 'nvme-', but that's a somewhat different kettle of fish). All of these names have the problem that they burn the transport (how you talk to the disk) into the /dev/disk/by-id, which is supposed to be a stable identifier for the disk as a standalone thing.
As Filippo Valsorda's case demonstrates, the problem is that some disks can move between transports. When this happens, the theoretically stable name of the disk changes; what was 'usb-' is now likely 'ata-' or vice versa, and in some cases other transformations may happen. Your attempt to use a stable name has failed and you will likely have problems.
Experimentally, there seem to be some /dev/disk/by-id names that are more stable. Some but not all of our disks have 'wwn-' names (one USB attached disk I can look at doesn't). Our Ubuntu based systems have 'scsi-<hex digits>' and 'scsi-SATA-<disk id>' names, but one of my Fedora systems with SATA drives has only the 'scsi-<hex>' names and the other one has neither. One system we have a USB disk on has no names for the disk other than 'usb-' ones. It seems clear that it's challenging at best to give general advice about how a random Linux user should pick truly stable /dev/disk/by-id names, especially if you have USB drives in the picture.
(See also Persistent block device naming in the Arch Wiki.)
This whole current situation seems less than ideal, to put it one way. It would be nice if disks (and partitions on them) had names that were as transport independent and usable as possible, especially since most disks have theoretically unique serial numbers and model names available (and if you're worried about cross-transport duplicates, you should already be at least as worried as duplicates within the same type of transport).
PS: You can find out what information udev knows about your disks with 'udevadm info --query=all --name=/dev/...' (from, via, by coincidence). The information for a SATA disk differs between my two Fedora machines (one of them has various SCSI_* and ID_SCSI* stuff and the other doesn't), but I can't see any obvious reason for this.
2024-10-25
Using pam_access to sometimes not use another PAM module
Suppose that you want to authenticate SSH logins to your Linux systems using some form of multi-factor authentication (MFA). The normal way to do this is to use 'password' authentication and then in the PAM stack for sshd, use both the regular PAM authentication module(s) of your system and an additional PAM module that requires your MFA (in another entry about this I used the module name pam_mfa). However, in your particular MFA environment it's been decided that you don't have to require MFA for logins from some of your other networks or systems, and you'd like to implement this.
Because your MFA happens through PAM and the details of this are opaque to OpenSSH's sshd, you can't directly implement skipping MFA through sshd configuration settings. If sshd winds up doing password based authentication at all, it will run your full PAM stack and that will challenge people for MFA. So you must implement sometimes skipping your MFA module in PAM itself. Fortunately there is a PAM module we can use for this, pam_access.
The usual way to use pam_access is to restrict or allow logins (possibly only some logins) based on things like the source address people are trying to log in from (in this, it's sort of a superset of the old tcpwrappers). How this works is configured through an access control file. We can (ab)use this basic matching in combination with the more advanced form of PAM controls to skip our PAM MFA module if pam_access matches something.
What we want looks like this:
auth [success=1 default=ignore] pam_access.so noaudit accessfile=/etc/security/access-nomfa.conf auth requisite pam_mfa
Pam_access itself will 'succeed' as a PAM module if the result of processing our access-nomfa.conf file is positive. When this happens, we skip the next PAM module, which is our MFA module. If it 'fails', we ignore the result, and as part of ignoring the result we tell pam_access to not report failures.
Our access-nomfa.conf file will have things like:
# Everyone skips MFA for internal networks +:ALL:192.168.0.0/16 127.0.0.1 # Insure we fail otherwise. -:ALL:ALL
We list the networks we want to allow password logins without MFA from, and then we have to force everything else to fail. (If you leave this off, everything passes, either explicitly or implicitly.)
As covered in the access.conf manual page, you can get quite sophisticated here. For example, you could have people who always had to use MFA, even from internal machines. If they were all in a group called 'mustmfa', you might start with:
-:(mustmfa):ALL
If you get at all creative with your access-nomfa.conf, I strongly suggest writing a lot of comments to explain everything. Your future self will thank you.
Unfortunately but entirely reasonably, the information about the remote source of a login session doesn't pass through to later PAM authentication done by sudo and su commands that you do in the session. This means that you can't use pam_access to not give MFA challenges on su or sudo to people who are logged in from 'trusted' areas.
(As far as I can tell, the only information ``pam_access' gets about the 'origin' of a su is the TTY, which is generally not going to be useful. You can probably use this to not require MFA on su or sudo that are directly done from logins on the machine's physical console or serial console.)
2024-10-24
Having an emergency backup DNS resolver with systemd-resolved
At work we have a number of internal DNS resolvers, which you very much want to use to resolve DNS names if you're inside our networks for various reasons (including our split-horizon DNS setup). Purely internal DNS names aren't resolvable by the outside world at all, and some DNS names resolve differently. However, at the same time a lot of the host names that are very important to me are in our public DNS because they have public IPs (sort of for historical reasons), and so they can be properly resolved if you're using external DNS servers. This leaves me with a little bit of a paradox; on the one hand, my machines must resolve our DNS zones using our internal DNS servers, but on the other hand if our internal DNS servers aren't working for some reason (or my home machine can't reach them) it's very useful to still be able to resolve the DNS names of our servers, so I don't have to memorize their IP addresses.
A while back I switched to using systemd-resolved on my machines. Systemd-resolved has a number of interesting virtues, including that it has fast (and centralized) failover from one upstream DNS resolver to another. My systemd-resolved configuration is probably a bit unusual, in that I have a local resolver on my machines, so resolved's global DNS resolution goes to it and then I add a layer of (nominally) interface-specific DNS domain overrides that point to our internal DNS resolvers.
(This doesn't give me perfect DNS resolution, but it's more resilient and under my control than routing everything to our internal DNS resolvers, especially for my home machine.)
Somewhat recently, it occurred to me that I could deal with the problem of our internal DNS resolvers all being unavailable by adding '127.0.0.1' as an additional potential DNS server for my interface specific list of our domains. Obviously I put it at the end, where resolved won't normally use it. But with it there, if all of the other DNS servers are unavailable I can still try to resolve our public DNS names with my local DNS resolver, which will go out to the Internet to talk to various authoritative DNS servers for our zones.
The drawback with this emergency backup approach is that systemd-resolved will stick with whatever DNS server it's currently using unless that DNS server stops responding. So if resolved switches to 127.0.0.1 for our zones, it's going to keep using it even after the other DNS resolvers become available again. I'll have to notice that and manually fiddle with the interface specific DNS server list to remove 127.0.0.1, which would force resolved to switch to some other server.
(As far as I can tell, the current systemd-resolved correctly handles the situation where an interface says that '127.0.0.1' is the DNS resolver for it, and doesn't try to force queries to 127.0.0.1:53 to go out that interface. My early 2013 notes say that this sometimes didn't work, but I failed to write down the specific circumstances.)
2024-10-15
A surprise with /etc/cron.daily, run-parts, and files with '.' in their name
Linux distributions have a long standing general cron feature where there is are /etc/cron.hourly, /etc/cron.daily, and /etc/cron.weekly directories and if you put scripts in there, they will get run hourly, daily, or weekly (at some time set by the distribution). The actual running is generally implemented by a program called 'run-parts'. Since this is a standard Linux distribution feature, of course there is a single implementation of run-parts and its behavior is standardized, right?
Since I'm asking the question, you already know the answer: there are at least two different implementations of run-parts, and their behavior differs in at least one significant way (as well as several other probably less important ones).
In Debian, Ubuntu, and other Debian-derived distributions (and also I think Arch Linux), run-parts is a C program that is part of debianutils. In Fedora, Red Hat Enterprise Linux, and derived RPM-based distributions, run-parts is a shell script that's part of the crontabs package, which is part of cronie-cron. One somewhat unimportant way that these two versions differ is that the RPM version ignores some extensions that come from RPM packaging fun (you can see the current full list in the shell script code), while the Debian version only skips the Debian equivalents with a non-default option (and actually documents the behavior in the manual page).
A much more important difference is that the Debian version ignores files with a '.' in their name (this can be changed with a command line switch, but /etc/cron.daily and so on are not processed with this switch). As a non-hypothetical example, if you have a /etc/cron.daily/backup.sh script, a Debian based system will ignore this while a RHEL or Fedora based system will happily run it. If you are migrating a server from RHEL to Ubuntu, this may come as an unpleasant surprise, partly since the Debian version doesn't complain about skipping files.
(Whether or not the restriction could be said to be clearly documented in the Debian manual page is a matter of taste. Debian does clearly state the allowed characters, but it does not point out that '.', a not uncommon character, is explicitly not accepted by default.)
2024-10-10
Linux software RAID and changing your system's hostname
Today, I changed the hostname of an old Linux system (for reasons) and rebooted it. To my surprise, the system did not come up afterward, but instead got stuck in systemd's emergency mode for a chain of reasons that boiled down to there being no '/dev/md0'. Changing the hostname back to its old value and rebooting the system again caused it to come up fine. After some diagnostic work, I believe I understand what happened and how to work around it if it affects us in the future.
One of the issues that Linux RAID auto-assembly faces is the question of what it should call the assembled array. People want their RAID array names to stay fixed (so /dev/md0 is always /dev/md0), and so the name is part of the RAID array's metadata, but at the same time you have the problem of what happens if you connect up two sets of disks that both want to be 'md0'. Part of the answer is mdadm.conf, which can give arrays names based on their UUID. If your mdadm.conf says 'ARRAY /dev/md10 ... UUID=<x>' and mdadm finds a matching array, then in theory it can be confident you want that one to be /dev/md10 and it should rename anything else that claims to be /dev/md10.
However, suppose that your array is not specified in mdadm.conf. In that case, another software RAID array feature kicks in, which is that arrays can have a 'home host'. If the array is on its home host, it will get the name it claims it has, such as '/dev/md0'. Otherwise, well, let me quote from the 'Auto-Assembly' section of the mdadm manual page:
[...] Arrays which do not obviously belong to this host are given names that are expected not to conflict with anything local, and are started "read-auto" so that nothing is written to any device until the array is written to. i.e. automatic resync etc is delayed.
As is covered in the documentation for the '--homehost' option in the mdadm manual page, on modern 1.x superblock formats the home host is embedded into the name of the RAID array. You can see this with 'mdadm --detail', which can report things like:
Name : ubuntu-server:0Name : <host>:25 (local to host <host>)
Both of these have a 'home host'; in the first case the home host
is 'ubuntu-server', and in the second case the home host is the
current machine's hostname. Well, its 'hostname' as far as mdadm
is concerned, which can be set in part through mdadm.conf's
'HOMEHOST
' directive. Let me repeat that, mdadm by default
identifies home hosts by their hostname, not by any more stable
identifier.
So if you change a machine's hostname and you have arrays not in your mdadm.conf with home hosts, their /dev/mdN device names will get changed when you reboot. This is what happened to me, as we hadn't added the array to the machine's mdadm.conf.
(Contrary to some ways to read the mdadm manual page, arrays are not renamed if they're in mdadm.conf. Otherwise we'd have noticed this a long time ago on our Ubuntu servers, where all of the arrays created in the installer have the home host of 'ubuntu-server', which is obviously not any machine's actual hostname.)
Setting the home host value to the machine's current hostname when an array is created is the mdadm default behavior, although you can turn this off with the right mdadm.conf HOMEHOST setting. You can also tell mdadm to consider all arrays to be on their home host, regardless of the home host embedded into their names.
(The latter is 'HOMEHOST <ignore>', the former by itself is 'HOMEHOST <none>', and it's currently valid to combine them both as 'HOMEHOST <ignore> <none>', although this isn't quite documented in the manual page.)
PS: Some uses of software RAID arrays won't care about their names. For example, if they're used for filesystems, and your /etc/fstab specifies the device of the filesystem using 'UUID=' or with '/dev/disk/by-id/md-uuid-...' (which seems to be common on Ubuntu).
PPS: For 1.x superblocks, the array name as a whole can only be 32 characters long, which obviously limits how long of a home host name you can have, especially since you need a ':' in there as well and an array number or the like. If you create a RAID array on a system with a too long hostname, the name of the resulting array will not be in the '<host>:<name>' format that creates an array with a home host; instead, mdadm will set the name of the RAID to the base name (either whatever name you specified, or the N of the 'mdN' device you told it to use).
(It turns out that I managed to do this by accident on my home desktop, which has a long fully qualified name, by creating an array with the name 'ssd root'. The combination turns out to be 33 characters long, so the RAID array just got the name 'ssd root' instead of '<host>:ssd root'.)
2024-09-30
Resetting the backoff restart delay for a systemd service
Suppose, not hypothetically, that your Linux machine is your DSL
PPPoE gateway, and you run the PPPoE software through a simple
script to invoke pppd that's run as a systemd .service unit. Pppd itself will exit if the link
fails for some reason,
but generally you want to automatically try to establish it again.
One way to do this (the simple way) is to set the systemd unit to
'Restart=always
', with a restart delay.
Things like pppd generally benefit from a certain amount of backoff
in their restart attempts, rather than restarting either slowly or
rapidly all of the time. If your PPP(oE) link just dropped out
briefly because of a hiccup, you want it back right away, not in
five or ten minutes, but if there's a significant problem with the
link, retrying every second doesn't help (and it may trigger things
in your service provider's systems). Systemd supports this sort of
backoff if you set 'RestartSteps
'
and 'RestartMaxDelaySec'
to appropriate values. So you could wind up with, for example:
Restart=always RestartSec=1s RestartSteps=10 RestartMaxDelaySec=10m
This works fine in general, but there is a problem lurking. Suppose that one day you have a long outage in your service but it comes back, and then a few stable days later you have a brief service blip. To your surprise, your PPPoE session is not immediately restarted the way you expect. What's happened is that systemd doesn't reset its backoff timing just because your service has been up for a while.
To see the current state of your unit's backoff, you want to look
at its properties, specifically 'NRestarts
' and especially
'RestartUSecNext
', which is the delay systemd will put on for the
next restart. You see these with 'systemctl show <unit>
', or
perhaps 'systemctl show -p NRestarts,RestartUSecNext <unit>
'.
To reset your unit's dynamic backoff time, you run 'systemctl
reset-failed <unit>
'; this is the same thing you may need to do
if you restart a unit too fast and the start stalls.
(I don't know if manually restarting your service with 'systemctl restart <unit>' bumps up the restart count and the backoff time, the way it can cause you to run into (re)start limits.)
At the moment, simply doing 'systemctl reset-failed' doesn't seem to be enough to immediately re-activate a unit that is slumbering in a long restart delay. So the full scale, completely reliable version is probably 'systemctl stop <unit>; systemctl reset-failed <unit>; systemctl start <unit>'. I don't know how you see that a unit is currently in a 'RestartUSecNext' delay, or how much time is left on the delay (such a delay doesn't seem to be a 'job' that appears in 'systemctl list-jobs', and it's not a timer unit so it doesn't show up in 'systemctl list-timers').
If you feel like making your start script more complicated (and it runs as root), I believe that you could keep track of how long this invocation of the service has been running, and if it's long enough, run a 'systemctl reset-failed <unit>' before the script exits. This would (manually) reset the backoff counter if the service has been up for long enough, which is often what you really want.
(If systemd has a unit setting that will already do this, I was unable to spot it.)
2024-09-28
Options for adding IPv6 networking to your libvirt based virtual machines
Recently, my home ISP switched me from an IPv6 /64 allocation to a /56 allocation, which means that now I can have a bunch of proper /64s for different purposes. I promptly celebrated this by, in part, extending IPv6 to my libvirt based virtual machine, which is on a bridged internal virtual network (cf). Libvirt provides three different ways to provide (public) IPv6 to such virtual machines, all of which will require you to edit your network XML (either inside the virt-manager GUI or directly with command line tools). The three ways aren't exclusive; you can use two of them or even all three at the same time, in which case your VMs will have two or three public IPv6 addresses (at least).
(None of this applies if you're directly bridging your virtual machines onto some physical network. In that case, whatever the physical network has set up for IPv6 is what your VMs will get.)
First, in all cases you're probably going to want an IPv6 '<ip>' block that sets the IPv6 address for your host machine and implicitly specifies your /64. This is an active requirement for two of the options, and typically looks like this:
<ip family='ipv6' address='2001:19XX:0:1102::1' prefix='64'> [...] </ip>
Here my desktop will have 2001:19XX:0:1102::1/64 as its address on the internal libvirt network.
The option that is probably the least hassle is to give static IPv6 addresses to your VMs. This is done with <host> elements inside a <dhcp> element (inside your IPv6 <ip>, which I'm not going to repeat):
<dhcp> <host name='hl-fedora-36' ip='2001:XXXX:0:1102::189'/> </dhcp>
Unlike with IPv4, you can't identify VMs by their MAC address because, to quote the network XML documentation:
[...] The IPv6
host
element differs slightly from that for IPv4: there is nomac
attribute since a MAC address has no defined meaning in IPv6. [...]
Instead you probably need to identify your virtual machines by their (DHCP) hostname. Libvirt has another option for this but it's not really well documented and your virtual machine may not be set up with the necessary bits to use it.
The second least hassle option is to provide a DHCP dynamic range of IPv6 addresses. In the current Fedora 40 libvirt, this has the undocumented limitation that the range can't include more than 65,535 IPv6 addresses, so you can't cover the entire /64. Instead you wind up with something like this:
<dhcp> <range start='2001:XXXX:0:1102::1000' end='2001:XXXX:0:1102::ffff'/> </dhcp>
Famously, not everything in the world does DHCP6; some things only do SLAAC, and in general SLAAC will allocate random IPv6 IPs across your entire /64. Libvirt uses dnsmasq (also) to provide IP addresses to virtual machines, and dnsmasq can do SLAAC (see the dnsmasq manual page). However, libvirt currently provides no directly exposed controls to turn this on; instead, you need to use a special libvirt network XML namespace to directly set up the option in the dnsmasq configuration file that libvirt will generate.
What you need looks like:
<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'> [...] <dnsmasq:options> <dnsmasq:option value='dhcp-range=2001:XXXX:0:1102::,slaac,64'/> </dnsmasq:options> </network>
(The 'xmlns:dnsmasq=' bit is what you have to add to the normal <network> element.)
I believe that this may not require you to declare an IPv6 <ip> section at all, although I haven't tested that. In my environment I want both SLAAC and a static IPv6 address, and I'm happy to not have DHCP6 as such, since SLAAC will allocate a much wider and more varied range of IPv6 addresses.
(You can combine a dnsmasq SLAAC dhcp-range with a regular DHCP6 range, in which case SLAAC-capable IPv6 virtual machines will get an IP address from both, possibly along with a third static IPv6 address.)
PS: Remember to set firewall rules to restrict access to those public IPv6 addresses, unless you want your virtual machines fully exposed on IPv6 (when they're probably protected on IPv4 by virtue of being NAT'd).
2024-09-23
Mostly getting redundant UEFI boot disks on modern Ubuntu (especially 24.04)
When I wrote about how our primary goal for mirrored (system) disks is increased redundancy, including being able to reboot the system after the primary disk failed, vowhite asked in a comment if there was any trick to getting this working with UEFI. The answer is sort of, and it's mostly the same as you want to do with BIOS MBR booting.
In the Ubuntu installer, when you set up redundant system disks it's long been the case that you wanted to explicitly tell the installer to use the second disk as an additional boot device (in addition to setting up a software RAID mirror of the root filesystem across both disks). In the BIOS MBR world, this installed GRUB bootblocks on the disk; in the UEFI world, this causes the installer to set up an extra EFI System Partition (ESP) on the second drive and populate it with the same sort of things as the ESP on the first drive.
(The 'first' and the 'second' drive are not necessarily what you think they are, since the Ubuntu installer doesn't always present drives to you in their enumeration order.)
I believe that this dates from Ubuntu 22.04, when Ubuntu seems
to have added support for multi-disk UEFI.
Ubuntu will mount one of these ESPs (the one it considers the
'first') on /boot/efi, and as part of multi-disk UEFI support it
will also arrange to update the other ESP. You can see what other
disk Ubuntu expects to find this ESP on by looking at the debconf selection 'grub-efi/install_devices
'.
For perfectly sensible reasons this will identify disks by their disk
IDs (as found in /dev/disk/by-id), and it normally lists both ESPs.
All of this is great but it leaves you with two problems if the disk with your primary ESP fails. The first is the question of whether your system's BIOS will automatically boot off the second ESP. I believe that UEFI firmware will often do this, and you can specifically set this up with EFI boot entries through things like efibootmgr (also); possibly current Ubuntu installers do this for you automatically if it seems necessary.
The bigger problem is the /boot/efi mount. If the primary disk fails, a mounted /boot/efi will start having disk IO errors and then if the system reboots, Ubuntu will probably be unable to find and mount /boot/efi from the now gone or error-prone primary disk. If this is a significant concern, I think you need to make the /boot/efi mount 'nofail' in /etc/fstab (per fstab(5)). Energetic people might want to go further and make it either 'noauto' so that it's not even mounted normally, or perhaps mark it as a systemd automounted filesystem with 'x-systemd.automount' (per systemd.mount).
(The disclaimer is that I don't know how Ubuntu will react if /boot/efi isn't mounted at all or is a systemd automount mountpoint. I think that GRUB updates will cope with having it not mounted at all.)
If any disk with an ESP on it fails and has to be replaced, you
have to recreate a new ESP on that disk and then, I believe, run
'dpkg-reconfigure grub-efi-amd64
', which will ask you to select
the ESPs you want to be automatically updated. You may then need
to manually run '/usr/lib/grub/grub-multi-install --target=x86_64-efi
',
which will populate the new ESP (or it may be automatically run
through the reconfigure). I'm not sure about this because we haven't
had any UEFI system disks fail yet.
(The ESP is a vfat formatted filesystem, which can be set up with mkfs.vfat, and has specific requirements for its GUIDs and so on, which you'll have to set up by hand in the partitioning tool of your choice or perhaps automatically by copying the partitioning of the surviving system disk to your new disk.)
If it was the primary disk that failed, you will probably want to update /etc/fstab to get /boot/efi from a place that still exists (probably with 'nofail' and perhaps with 'noauto'). This might be somewhat easy to overlook if the primary disk fails without the system rebooting, at which point you'd get an unpleasant surprise on the next system reboot.
The general difference between UEFI and BIOS MBR booting for this is that in BIOS MBR booting, there's no /boot/efi to cause problems and running 'grub-install' against your replacement disk is a lot easier than creating and setting up the ESP. As I found out, a properly set up BIOS MBR system also 'knows' in debconf what devices you have GRUB installed on, and you'll need to update this (probably with 'dpkg-reconfigure grub-pc') when you replace a system disk.
(We've been able to avoid this so far because in Ubuntu 20.04 and 22.04, 'grub-install' isn't run during GRUB package updates for BIOS MBR systems so no errors actually show up. If we install any 24.04 systems with BIOS MBR booting and they have system disk failures, we'll have to remember to deal with it.)
(See also my entry on multi-disk UEFI in Ubuntu 22.04, which goes deeper into some details.
That entry was written before I knew that a 'grub-*/install_devices
'
setting of a software RAID array was actually an error on Ubuntu's
part, although I'd still like GRUB's UEFI and BIOS MBR scripts to
support it.)