Linux software RAID mirrors, booting, mdadm.conf, and disk counts for non-fun

January 24, 2023

Linux software RAID mirrors have a count of the number of active disks that are in the array; this is what is set or changed by mdadm's --raid-devices argument. Your mdadm.conf may also list how many active disks an array is supposed to have, in the 'num-devices=' setting (aka a 'tag') for a particular array. The mdadm.conf manual page dryly describes this as "[a]s with level= this is mainly for compatibility with the output of mdadm --examine --scan", which historically and currently is not quite accurate, at least when booting (perhaps only under systemd).

I will give my current conclusion up front; if you're currently specifying num-devices= for any software RAID mirrors in your mdadm.conf, you should probably take the setting out. I can't absolutely guarantee that this is either harmless or an improvement, but the odds seem good.

Updating the device count in software RAID mirrors is required when you add devices, for example to add your new disks along side your old disks, and recommended when you remove disks (removing your old disks because you've decided that your new disks are fine). If you don't increase the number of devices when you add extra disks, what you're really doing is adding spares. If you don't decrease the number of devices on removal, mdadm will send you error reports and generally complain that there are devices missing. So let's assume that your software RAID mirror has a correct count.

Let's suppose that you have num-devices set in mdadm.conf and that your root filesystem's mdadm.conf is the same as the version in your initramfs (an important qualification because it's the version in the initramfs that counts during boot). Then there are several cases you may run into. The happy cases is that the mdadm.conf disk count matches the actual array's disk count and all disks are visible and included in the live array. Congratulations, you're booting fine.

If the mdadm.conf num-devices is higher than the number of devices claimed by the software RAID array, and the extra disks you removed are either physically removed or have had their RAID superblocks zeroed, then your boot will probably stall and likely error out, or at least that's my recent experience. This is arguably reasonable, especially if num-devices is a genuinely optional parameter in mdadm.conf; you told the boot process this array should have four devices but now it has two, so something is wrong.

If the mdadm.conf num-devices is higher than the number of devices claimed by the array but the extra disks you removed are present and didn't have their RAID superblock zeroed, havoc may ensue. It seems quite likely that your system will assemble the wrong disks into the software RAID array; perhaps it prefers the first disk you failed out and removed, because it still claims to be part of a RAID array that has the same number of disks as mdadm.conf says it should have.

(The RAID superblocks on devices have both a timestamp and an event count, so mdadm could in theory pick the superblocks with the highest event count and timestamp, especially if it can assemble an actual mirror out of them instead of only having one device out of four. But mdadm is what it is.)

If the mdadm.conf num-devices is lower than the number of devices claimed by the software RAID array and all of the disks are present and in sync with each other, then your software RAID array will assemble without problems during boot. This seems to make num-devices a minimum for the number of disks your boot environment expects to see before it declares the RAID array healthy; if you provide extra disks, that's fine with mdadm. However, if you've removed some disks from the array and not zeroed their superblocks, in the past I've had the system assemble the RAID array with the wrong disk even though the RAID superblocks on the other disks agreed with mdadm.conf's num-devices. That may not happen today.

A modern system with all the disks in sync will boot with an mdadm.conf that doesn't have any num-devices settings. This is in fact the way that our Ubuntu 18.04, 20.04, and 22.04 servers set up their mdadm.conf for the root software RAID array, and it works for me on Fedora 36 for some recently created software RAID arrays (that aren't my root RAID array). However, I don't know how such a system reacts when you remove a disk from the RAID array but don't zero the disk's RAID superblock. On the whole I suspect that it won't be worse than what happens when num-devices is set.

Written on 24 January 2023.
« I should always make a checklist for anything complicated
You should back up the settings for your Firefox addons periodically »

Page tools: View Source.
Search:
Login: Password:

Last modified: Tue Jan 24 22:37:33 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.