Switching Linux software RAID disks around in (early) 2023

December 31, 2023

Back at the start of this year I moved my (software RAID) root filesystem on my home Fedora desktop from a mirrored pair of SATA SSDs to a pair of NVMe drives, and this time I kept notes (although I didn't necessarily follow them). For my future use, I'm going to write this up, complete with the steps that I should have done but didn't.

(In this switch, my new disks are nvme0n1p3 and nvme1n1p3, my old disks were sda3 and sdb3, and md10 was the official name of my root filesystem's software RAID mirror.)

As is my custom with such disk switches, I first changed my root filesystem software RAID to being a four way mirror, using both the SATA SSDs and the NVMe drives. The process for this is to add the extra devices and then increase the number of devices in the RAID:

mdadm -a /dev/md10 /dev/nvme0n1p3
mdadm -a /dev/md10 /dev/nvme1n1p3
mdadm -G -n 4 /dev/md10

If you don't increase the number of devices, you've just added some spares. This is definitely not what I want; when I do this, I want the new drives to be in (full) use in parallel to the old ones, as a burn-in test. (Often an extended one, as it was this time.)

(If you want you can add one device at a time then let your system run that way for a bit, but I usually don't see any reason to go through extra steps.)

In the past you needed to update /etc/mdadm.conf to have the new number of drives in your software RAID array and rebuild your initramfs (to update its embedded copy of mdadm.conf) or you'd have boot failures (cf). Currently this isn't (or wasn't) necessary on Fedora, as things appear to accept software RAID arrays that have more member devices than mdadm.conf specifies, as I found out when there was an unplanned machine freeze and reboot before I did the initramfs update.

(Alternately you should take the count of devices out entirely from your mdadm.conf. Your initramfs will have to be rebuilt before this takes full effect, but you can perhaps wait for this to happen as part of your distribution's next kernel update.)

Once you've decided that your new drives are stable, you transition away from the old devices by marking them failed and then removing them:

mdadm --fail /dev/md10 /dev/sda3
mdadm --fail /dev/md10 /dev/sdb3
mdadm --remove /dev/md10 /dev/sda3
mdadm --remove /dev/md10 /dev/sdb3

You must use '--remove', not '-r'. After doing this there are two essential things you need to do, neither of which I actually did, to my eventual sorrow. First, you have to zero the RAID superblocks on the old devices (this has been an issue for a long time):

mdadm --zero-superblock /dev/sda3
mdadm --zero-superblock /dev/sdb3

If you don't zero the old superblocks, your system may well reboot with their old version of your root filesystem instead of the current one, and you'll have to immediately halt the system and physically pull the old drives (you might as well dust it out while you have it open, if this is a desktop). If you had other stuff on the old drives in addition to the old software RAID mirrors, well, you would be in some trouble.

Once you've removed the old disks (and zeroed their superblocks), you then need to shrink the number of devices in the software RAID array back down to two devices (otherwise various things will complain about missing devices):

mdadm -G -n 2 /dev/md10

However, unlike the case of adding drives, after shrinking the number of devices in the array you have to update /etc/mdadm.conf to have the new device count and then rebuild your initramfs so that it includes your new mdadm.conf; on Fedora this is done with with 'dracut --force'. Fedora's Dracut initramfs environment will accept a software RAID array with more devices than specified, but (perhaps reasonably) it will refuse to accept one with fewer devices. Alternately, you can completely remove num-devices= from your mdadm.conf, although you'll still need to rebuild your initramfs if you haven't done this already.

(I believe you get dropped into an emergency rescue shell and are left to fix things up yourself. I didn't keep notes on this process; interested parties are encouraged to experiment in a virtual machine.)

When I moved away from the old SATA SSDs, I forgot to zero the old RAID superblocks and then (after fixing that) I discovered that I'd incorrectly assumed that Fedora's initramfs didn't care about all drive number changes. Hopefully I'll remember next time around, or at least re-read this entry, which is (or was) current as of my experiences in early to mid 2023 (things keep changing in this area of Linux).

As advice for my future self, what I should have done is written out a full checklist in advance and then ticked things off as I went through them. This would have made sure that I didn't forget important steps (like zeroing the old RAID superblocks), or let them slide with the excuse that they'd happen as a side effect of my next kernel update (because my system can always reboot by surprise before then).

(I've written entries about this in the past, 1, 2, 3, as well as shrinking a mirrored swap partition.)

Written on 31 December 2023.
« Email addresses are not good 'permanent' identifiers for accounts
Alerting on our NTP servers having a high NTP stratum hasn't been useful »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Dec 31 22:52:23 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.