Ubuntu 12.04 can't reliably boot with software RAID (and why)

July 20, 2012

Recently one of my co-workers discovered, diagnosed, and worked around a significant issue with software RAID on Ubuntu 12.04. I'm writing it up here partly to get it all straight in my head and partly so we can help out anyone else with the same problem. The quick summary of the situation comes from my tweet:

Ubuntu 12.04 will not reliably boot a system with software RAID arrays due to races in the initramfs scripts.

(As you might guess, I am not happy.)

If you set up Ubuntu 12.04 with one or more software RAID arrays for things other than the root filesystem, you will almost certainly find that some of the time when you reboot your system it will come up with one or more software RAID arrays in a degraded state with one or more component devices not added to the array. If you have set bootdegraded=true as one of your boot options (eg on the kernel command line), your system will boot fully (and you can hot-add omitted device back to the array); if you haven't, the initramfs will pause briefly to ask you if you want to continue booting anyways, time out on the question, and drop you into an initramfs shell.

This can happen whether or not your root filesystem is on a software RAID array (although it doesn't happen to the root array itself, only to other arrays) and even if you do not have the software RAID arrays configured or used in your system in any way (not listed in /etc/mdadm/mdadm.conf, not used in /etc/fstab and so on); simply having software RAID arrays on a disk attached to your system at boot time is enough to trigger the problem. It doesn't require disks that are slow to respond to the kernel (to the extent that we've reproduced this in VMWare, where the disks aren't even physical and respond to kernel probes basically instantly).

Now let's talk about how this happens.

Like other modern systems Ubuntu 12.04 handles device discovery with udev, even during early boot in the initramfs. Part of udev's device discovery is the assembly of RAID arrays from components. What this means is that software RAID assembly is asynchronous; the initramfs starts the udev daemon, the daemon ends up with a list of events to process, and as it works through them the software RAID arrays start to appear. In the mean time the rest of the initramfs boot process continues on and in short order sets itself up to mount the root filesystem. As part of preparing to mount the root filesystem, the initramfs code then checks to see if all visible arrays are fully assembled and healthy without waiting for udev to have processed all pending events. You know, the events that can include incrementally assembling those arrays.

This is a race. If udev wins the race and fully assembles all visible software RAID arrays before the rest of the initramfs checks them, you win and your system boots. If udev loses the race, you lose; the check for degraded software RAID arrays will see some partially assembled arrays and throw up its hands.

Our brute force solution is to modify the check for degraded software RAID arrays to explicitly wait for the udev event queue to drain by running 'udevadm settle'. This appears to work so far but we haven't extensively tested it; it's possible that there's still a race present but it's now small enough that we haven't managed to hit it yet.

This is unquestionably an Ubuntu bug and I hope that it will be fixed in some future update.

Sidebar: our fix in specific

(For the benefit of anyone with this problem who's doing Internet searches.)

Change /usr/share/initramfs-tools/scripts/mdadm-functions as follows:

 degraded_arrays()
 {
+	udevadm settle
 	mdadm --misc --scan --detail --test >/dev/null 2>&1
 	return $((! $?))
 }

Then rebuild your current initramfs by running 'update-initramfs -u'.

Since I suspect that mdadm-functions is not considered a configuration file, you may want to put a dpkg hold on the Ubuntu mdadm package so that an automatic upgrade doesn't wipe out your change.

(This may not be the best and most Ubuntu-correct solution. It's just what we've done and tested right now.)

Sidebar: where the bits of this are on 12.04

  • /lib/udev/rules.d/85-mdadm.rules: the udev rule to incrementally assemble software RAID arrays as components become available.

Various parts of the initramfs boot process are found (on a running system) in /usr/share/initramfs-tools/scripts:

  • init-top/udev: the scriptlet that starts udev.

  • local-premount/mdadm: the scriptlet that checks for all arrays being good; however, it just runs some functions from the next bit. (All of local-premount is run by the local scriptlet, which is run by the initramfs /init if the system is booting from a local disk.)

  • mdadm-functions: the code that does all the work of checking and 'handling' incomplete software RAID arrays.

Looking at this, I suspect that a better solution is to stick our own script in local-premount, arranged to run before the mdadm script, and have it run the 'udevadm settle'. That would avoid changing any package-supplied scripts.

(Testing has shown that creating a local-top/mdadm-settle scriptlet isn't good enough. It gets run, but too early. This probably means that modifying the degraded_arrays function is the most reliable solution since it happens the closest to the actual check, and we just get to live with modifying a package-supplied file and so on.)


Comments on this page:

From 173.48.253.198 at 2012-07-21 08:48:17:

Is there an Ubuntu bug for this?

From 76.102.104.56 at 2012-07-21 11:35:06:

Is this the same issue as https://bugs.launchpad.net/ubuntu/+source/udev/+bug/631795 ?

If so you may want to chime in there along with your workaround.

Benjamin Franz

By cks at 2012-07-21 23:43:27:

We haven't searched for an Ubuntu bug about this or filed one. Partly this is because we only just finished working through the whole thing on Friday (the day I wrote this entry).

I don't think this is the bug Benjamin Franz pointed to, partly because this problem isn't present in 10.04 and so far is specific to boot time. Having said that, it's entirely possible that udev's asynchronous nature is causing problems elsewhere as well.

From 150.101.192.193 at 2012-07-26 01:41:25:

I think your bug is this one: Bug 942106.

This was discovered in an early 12.04 alpha and a package just arrived in -proposed to fix it.

Software RAID is just not part of the systems Ubuntu cares about. We buy RAID cards since 2010.

By cks at 2012-07-26 10:28:43:

For future reference, that's bug 942106.

(I certainly hope that Ubuntu cares about software RAID. We are not interested in spending money on hardware RAID to get a worse experience than software RAID.)

By cks at 2012-07-26 10:31:22:

Whoops, forgot to put this in: thanks for finding the bug report and telling me about it. It's reassuring to know that this is actually a known issue that Ubuntu is working to fix.

(It's not that reassuring to know that Ubuntu has known about this since a 12.04 alpha and still hasn't fixed it, especially when it's a relatively obvious fix.)

From 108.162.128.107 at 2012-11-19 22:45:15:

Awesome article - thanks for posting this information. I tried adding the 'udevadm settle' per your suggestion, but unfortunately this did not solve the race condition for my RAID6 array in Ubuntu Server 12.04.1 LTS.

However, I did manage to prevent Ubuntu from auto-assembling my RAID6 array by doing the following:

sudo mv /lib/udev/rules.d/64-md-raid.rules ~
sudo update-initramfs -k all -u
sudo reboot

Note that depending on the version of Ubuntu Server you're running, the file may be called /lib/udev/rules.d/85-mdadm.rules, since some cleanup has been done (see Ubuntu Bug #1002357).

After hours of troubleshooting, my system boots up now without dropping to the initramfs shell, and I can assemble the array manually or in a startup script:

sudo mdadm --assemble /dev/md0 --scan --no-degraded

Hopefully someone out there finds this useful!

David Kennedy

From 74.108.57.124 at 2013-05-17 11:20:33:

some of the issues may relate to this bug http://neil.brown.name/blog/20120615073245

Hi Chris, thanks for your report. I think, however, that "udevadm settle" is not enough. It waits for the udev queue to be empty - but that, unfortunately, seems not to mean that there's no more udev stuff to come. In fact, we have an Ubuntu 12.04.2 system with updated mdadm scripts, that first adds a couple of SATA drives, then needs several seconds to initiate a SAS controller. In the mean time, we always seem to end up with a degraded RAID. Here, the "settle" seems to see an empty queue somewhere halfway loading the rest of the drivers :-(

So, probably "sleep 9999999; udevadm settle" is the safest way to go but even that isn't race condition free... ;-)

By cks at 2013-10-15 12:04:20:

Yes, you're right; that all pending udev events have been processed is just a (mostly successful) heuristic. What you really want is some sort of 'all device probing and enumeration is finished' event; in the abscence of that all you can do is wait a while in the hopes that all is well.

Written on 20 July 2012.
« The temptation of selective sender address verification
A sleazy trick to capture debugging output from an initramfs »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jul 20 23:23:08 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.