Software RAID, udev, and failed disks

March 6, 2008

Suppose that you have a software RAID array. Suppose further that you have a disk or two fail spectacularly; they don't just have errors, they go offline completely.

Naturally, software RAID fails the disks out; you wind up with something in /proc/mdstat that looks like this:

md10 : active raid6 sdbd1[12] sdbc1[11] sdbb1[10] sdba1[9] sdaz1[13](F) sday1[7] sdax1[6] sdaw1[5] sdav1[14](F) sdau1[3] sdat1[2] sdas1[1] sdar1[0]

(Yes, this system does have a lot of disks. Part of it is that multipathed FibreChannel makes disks multiply like rabbits.)

So we want to remove the failed disks from the array (perhaps because we have pulled out their hot-swap drive sleds in order to swap new disks in):

# mdadm /dev/md10 -r /dev/sdav1
mdadm: cannot find /dev/sdav1: No such file or directory

This would be because udev removed the /dev nodes for the disks when they went offline, which is perfectly sensible behavior except it presents us with a bit of a chicken and egg problem.

(If this was a Fedora system with mdadm 2.6.2 I might be able to use the '-r failed' option, but this is a Red Hat Enterprise 5 system with mdadm 2.5.4, and I am out of luck. And if I wanted to remove just one of the two failed drives, I would still be out of luck even on Fedora.)

Reinserting the drives doesn't help, at least in this case, as the system sees them as entirely new drives and assigns them a different sd-something name. (It does this even if they are literally the same disk, because you artificially induced this failure by pulling the drive sleds in the first place.)


Comments on this page:

From 82.95.233.55 at 2008-03-07 02:49:16:

And how did this end? :-)

now you made me really curious ...

-- Natxo Asenjo

From 195.214.232.10 at 2008-03-07 05:03:29:

What is about '-f' option?

By cks at 2008-03-07 14:43:40:

Using -f doesn't help, presumably because mdadm genuinely can't proceed without finding the device.

This isn't a deadly problem (at least in the short term) because you can still hot-add the new names of the reinserted drives back to the array. Short of installing mdadm 2.6.2, the only way I know of to get the failed devices out of the array is to reboot the system (at which point all the devices reshuffle to their 'right' names).

Written on 06 March 2008.
« The difference between operations and system administration
My problem with Ethernet naming on Red Hat Enterprise 5 »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Mar 6 23:53:38 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.