Software RAID, udev, and failed disks

March 6, 2008

Suppose that you have a software RAID array. Suppose further that you have a disk or two fail spectacularly; they don't just have errors, they go offline completely.

Naturally, software RAID fails the disks out; you wind up with something in /proc/mdstat that looks like this:

md10 : active raid6 sdbd1[12] sdbc1[11] sdbb1[10] sdba1[9] sdaz1[13](F) sday1[7] sdax1[6] sdaw1[5] sdav1[14](F) sdau1[3] sdat1[2] sdas1[1] sdar1[0]

(Yes, this system does have a lot of disks. Part of it is that multipathed FibreChannel makes disks multiply like rabbits.)

So we want to remove the failed disks from the array (perhaps because we have pulled out their hot-swap drive sleds in order to swap new disks in):

# mdadm /dev/md10 -r /dev/sdav1
mdadm: cannot find /dev/sdav1: No such file or directory

This would be because udev removed the /dev nodes for the disks when they went offline, which is perfectly sensible behavior except it presents us with a bit of a chicken and egg problem.

(If this was a Fedora system with mdadm 2.6.2 I might be able to use the '-r failed' option, but this is a Red Hat Enterprise 5 system with mdadm 2.5.4, and I am out of luck. And if I wanted to remove just one of the two failed drives, I would still be out of luck even on Fedora.)

Reinserting the drives doesn't help, at least in this case, as the system sees them as entirely new drives and assigns them a different sd-something name. (It does this even if they are literally the same disk, because you artificially induced this failure by pulling the drive sleds in the first place.)

Written on 06 March 2008.
« The difference between operations and system administration
My problem with Ethernet naming on Red Hat Enterprise 5 »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Mar 6 23:53:38 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.