An important additional step when shifting software RAID mirrors around

April 5, 2014

After going through all of the steps from yesterday's entry to move my mirrors from one disk to another, I inadvertently discovered a vital additional step you need to take here. The additional step is:

  • After you've taken the old disk out of the mirror and shrunk the mirror (steps 4 and 5), either destroy the old disk's RAID superblock or physically remove the disk from your system. I believe that RAID superblocks can be destroyed with the following (where /dev/sdb7 is the old disk):
    mdadm --zero-superblock /dev/sdb7

Failure to do this may cause your system to malfunction either subtly or spectacularly on boot (malfunctioning spectacularly is best because that insures you notice it). The culprit here is the how a modern Linux system assembles RAID arrays on boot. Put simply, there is nothing that forces all of your RAID arrays to be assembled using your current mirrors instead of the obsolete mirrors on your old disk. Instead it seems to come down to which device is processed first. If a partition on your old disk is processed first, it wins the race and becomes the sole member of the RAID array (which may then fail to activate because it doesn't have the full device set). If you're lucky your system now refuses to boot; if you're unlucky, your system boots but with obsolete and unmirrored filesystems and anything important written to them will cause you a great deal of heartburn as you try to sort out the resulting mess.

(Linux software RAID appears to be at least smart enough to know that your two current mirror devices and the old disk are not compatible and so doesn't glue them all together. I don't know what GRUB's software RAID code does here if your boot partition is on a software RAID mirror that has had this happen to it.)

This points out core architectural flaws in both the asynchronous assembly process and the approach of removing obsolete devices by failing them first. If mdadm had a 'remove active device' operation, it could at least somehow mark the removed device's superblock as 'do not use to auto-assemble array, this device has been explicitly removed'. If the assembly process was not asynchronous the way it is, it could see that some mirror devices were more recent than others and prefer them. But sadly, well, no.

(In theory a not yet activated software RAID array could be revised to kick out the out of date device and replace it with the newer device (although there are policy issues involved). This can't be done at all once the array has been activated, or rather while the array is active.)

Written on 05 April 2014.
« Shifting a software RAID mirror from disk to disk in modern Linux
How not to generate If-Modified-Since headers for conditional GETs »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Apr 5 02:05:37 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.