How Linux servers should boot with software RAID

February 26, 2013

One of the things that I like least about the modern Linux server environment is how badly software RAID interacts with the initramfs in modern distributions and how easy it is to detonate your systems as a result. Ubuntu is especially obnoxious, but it's not the only offender. When this stuff blows up in my face it makes me angry for all sorts of reasons; not just that things have blown up (usually because I failed to do a magic rain dance, sometimes just on their own) but also because the people who built the initramfs systems for this should have known better.

Linux booting with software RAID arrays should work like this:

  • the initramfs should not care about or even try to assemble any RAID array(s) apart from those necessary to start the root filesystem and maybe any other immediately crucial filesystems (/usr, perhaps). Yes, working out which RAID array is needed for the root filesystem might take a little bit of work. Tough; do it anyways when you build the initramfs.

    (If the root filesystem can't be found, it's sensible for the the initramfs to ask you if it should assemble any other RAID arrays it can find to see if the root filesystem was lurking on one of them. But this is part of the initramfs recovery environment, not part of normal booting.)

  • provided that the RAID array(s) are intact and functioning, failure of the RAID arrays to exactly correspond to the expected configuration should not abort booting. To be blunt, if you add a third mirror to your root filesystem's RAID array your system should not then fail to boot because you didn't perform a magic rain dance.

    (Yes, this happened to me. It can happen to you.)

  • server oriented installs should default to booting with degraded but working RAID arrays for the root filesystem (or any other filesystem). Allow the sysadmin to change that if they want, ask about it in the installer if you want, but trust me, almost no sysadmin is RAIDing their filesystems so that the machine can fail to start if one disk dies. It's generally the exact opposite.

  • additional RAID arrays should be assembled only after the system has started to boot on the real root filesystem. Ideally they would be assembled only if they are listed in mdadm.conf or if they seem necessary to mount some /etc/fstab filesystem. If the system can't assemble RAID arrays that are necessary for /etc/fstab listed filesystems, well, this is no different than if some filesystem in /etc/fstab is unavailable for another reason; drop the system into a single-user shell or whatever else you do.

I think that it's legitimate for a system to treat failure to assemble an array listed in mdadm.conf as equivalent to failure to mount a filesystem. Other people may disagree. Providing a configuration option for this may be wise.

It may be that there are some system configurations where the initramfs building system absolutely can't work out what RAID array(s) are needed for the root filesystem. In this situation and this situation only the initramfs can try more RAID arrays than usual. But it should not have to in straightforward situations such as filesystem on RAID or filesystem on LVM on RAID; both are common and can be handled without problems.

The following are somewhat less important wishlist items.

  • there should be an obvious, accessible setting for 'only touch RAID arrays that are listed in mdadm.conf, never touch ones that aren't'.

  • the system should never attempt to automatically assemble a damaged RAID array that is not listed in mdadm.conf, no matter what. When the system encounters an unknown RAID array its first directive should be 'do no harm'; it should only touch the RAID array if it can be sure that it is not damaging anything. An incomplete RAID array or one that needs resynchronization does not qualify.

  • the initramfs should not contain a copy of mdadm.conf. There are too many things that can go wrong if there is one, even if it's not supposed to be consulted. The only thing that the initramfs really needs to boot the system is the UUID(s) of the crucial RAID array(s), and it should contain this information directly.

(If some software absolutely has to have a non-empty mdadm.conf to work, the initramfs mdadm.conf should be specially created with the absolute minimum necessary information in it. Copying the normal system mdadm.conf is both lazy and dangerous.)

Sidebar: where initramfses fail today

There are two broad failures and thus two failure modes.

One failure is that your initramfs likely silently includes a copy of mdadm.conf and generally uses the information there to assemble RAID arrays. If this initramfs copy of mdadm.conf no longer matches reality, bad things often happen. Often this mismatch doesn't have to be particularly major; in some systems, it can even be trivial. This is especially dangerous because major distributions don't put a big bold-type warning in mdadm.conf saying 'if you change this file, immediately rebuild your initramfs by doing ...'. (Not that this is good enough.)

The other failure is Ubuntu's failure, where the initramfs tries to assemble all RAID arrays that it can find on devices, whether or not they are listed in its mdadm.conf, and then if any of them fail to assemble properly the initramfs throws up its hands and stops. This is terrible in all sorts of ways that should have been obvious to the clever people who put all the pieces of this scheme together.

(Rebuilding the initramfs is not good enough partly because it isn't rebuilding just one initramfs; you need to rebuild the initramfs for all of the theoretically functioning kernels you have sitting around in /boot, assuming that you actually want to be able to use them.)


Comments on this page:

From 129.102.5.21 at 2013-02-27 05:46:02:

While I understand and fully agree with your point of "don't touch RAID arrays unless and until you need them", your approach is a bit too restrictive for my taste as the "only things in mdadm.conf or fstab" part completely ignores virtual machines.

This is basically a variant of a bug in RHEL5 where networked LVM volumes (think iSCSI) are not activated unless at least one filesystem in fstab was marked as remote (with the _netdev flag). In my case that meant VMs hosted on LVM-over-iSCSI were not started at boot.

Of course here I mean "VMs needed explicit configuration to start" whereas in my opinion they should not need it any more than a filesystem; I guess my point is "fstab is not the only implicit source of info".

-- Arnaud Gomes

By cks at 2013-02-27 13:15:25:

My view is that any host-managed RAID arrays for VMs should be listed in the host's mdadm.conf. If they're actually guest-managed arrays, I definitely don't want the host trying to assemble them underneath the guests.

(One way of putting this is that I consider mdadm.conf to be the RAID array version of fstab. Your example shows that LVM probably needs some equivalent too.)

From 68.195.89.131 at 2013-02-27 13:30:06:

has recovery of any software raid array ever caused a problem for you because of too much cpu consumption?

By cks at 2013-02-27 15:18:37:

The problem with array resynchronization isn't that it puts a CPU or IO load on the system (although that's an issue), it's that it necessarily writes to at least one disk. If the apparent array configuration or the resynchronization is off, this can destroy good data on that disk (and this is not theoretical, it has happened to people).

Therefor the system should not start a resync operation on a RAID array it has simply discovered from on-disk data, because that data may be wrong or there may be things about the overall situation that the system doesn't know. Only if the on-disk data matches mdadm.conf is it reasonably safe to (re)start a resynch.

Written on 26 February 2013.
« Thinking about how much Solaris 11 is worth to us
Link: Go at Google: Language Design in the Service of Software Engineering »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Feb 26 23:48:07 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.