Ubuntu 12.04 can't reliably boot with software RAID (and why)

July 20, 2012

Recently one of my co-workers discovered, diagnosed, and worked around a significant issue with software RAID on Ubuntu 12.04. I'm writing it up here partly to get it all straight in my head and partly so we can help out anyone else with the same problem. The quick summary of the situation comes from my tweet:

Ubuntu 12.04 will not reliably boot a system with software RAID arrays due to races in the initramfs scripts.

(As you might guess, I am not happy.)

If you set up Ubuntu 12.04 with one or more software RAID arrays for things other than the root filesystem, you will almost certainly find that some of the time when you reboot your system it will come up with one or more software RAID arrays in a degraded state with one or more component devices not added to the array. If you have set bootdegraded=true as one of your boot options (eg on the kernel command line), your system will boot fully (and you can hot-add omitted device back to the array); if you haven't, the initramfs will pause briefly to ask you if you want to continue booting anyways, time out on the question, and drop you into an initramfs shell.

This can happen whether or not your root filesystem is on a software RAID array (although it doesn't happen to the root array itself, only to other arrays) and even if you do not have the software RAID arrays configured or used in your system in any way (not listed in /etc/mdadm/mdadm.conf, not used in /etc/fstab and so on); simply having software RAID arrays on a disk attached to your system at boot time is enough to trigger the problem. It doesn't require disks that are slow to respond to the kernel (to the extent that we've reproduced this in VMWare, where the disks aren't even physical and respond to kernel probes basically instantly).

Now let's talk about how this happens.

Like other modern systems Ubuntu 12.04 handles device discovery with udev, even during early boot in the initramfs. Part of udev's device discovery is the assembly of RAID arrays from components. What this means is that software RAID assembly is asynchronous; the initramfs starts the udev daemon, the daemon ends up with a list of events to process, and as it works through them the software RAID arrays start to appear. In the mean time the rest of the initramfs boot process continues on and in short order sets itself up to mount the root filesystem. As part of preparing to mount the root filesystem, the initramfs code then checks to see if all visible arrays are fully assembled and healthy without waiting for udev to have processed all pending events. You know, the events that can include incrementally assembling those arrays.

This is a race. If udev wins the race and fully assembles all visible software RAID arrays before the rest of the initramfs checks them, you win and your system boots. If udev loses the race, you lose; the check for degraded software RAID arrays will see some partially assembled arrays and throw up its hands.

Our brute force solution is to modify the check for degraded software RAID arrays to explicitly wait for the udev event queue to drain by running 'udevadm settle'. This appears to work so far but we haven't extensively tested it; it's possible that there's still a race present but it's now small enough that we haven't managed to hit it yet.

This is unquestionably an Ubuntu bug and I hope that it will be fixed in some future update.

Sidebar: our fix in specific

(For the benefit of anyone with this problem who's doing Internet searches.)

Change /usr/share/initramfs-tools/scripts/mdadm-functions as follows:

 degraded_arrays()
 {
+	udevadm settle
 	mdadm --misc --scan --detail --test >/dev/null 2>&1
 	return $((! $?))
 }

Then rebuild your current initramfs by running 'update-initramfs -u'.

Since I suspect that mdadm-functions is not considered a configuration file, you may want to put a dpkg hold on the Ubuntu mdadm package so that an automatic upgrade doesn't wipe out your change.

(This may not be the best and most Ubuntu-correct solution. It's just what we've done and tested right now.)

Sidebar: where the bits of this are on 12.04

  • /lib/udev/rules.d/85-mdadm.rules: the udev rule to incrementally assemble software RAID arrays as components become available.

Various parts of the initramfs boot process are found (on a running system) in /usr/share/initramfs-tools/scripts:

  • init-top/udev: the scriptlet that starts udev.

  • local-premount/mdadm: the scriptlet that checks for all arrays being good; however, it just runs some functions from the next bit. (All of local-premount is run by the local scriptlet, which is run by the initramfs /init if the system is booting from a local disk.)

  • mdadm-functions: the code that does all the work of checking and 'handling' incomplete software RAID arrays.

Looking at this, I suspect that a better solution is to stick our own script in local-premount, arranged to run before the mdadm script, and have it run the 'udevadm settle'. That would avoid changing any package-supplied scripts.

(Testing has shown that creating a local-top/mdadm-settle scriptlet isn't good enough. It gets run, but too early. This probably means that modifying the degraded_arrays function is the most reliable solution since it happens the closest to the actual check, and we just get to live with modifying a package-supplied file and so on.)

Written on 20 July 2012.
« The temptation of selective sender address verification
A sleazy trick to capture debugging output from an initramfs »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jul 20 23:23:08 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.