Wandering Thoughts archives

2008-03-06

Software RAID, udev, and failed disks

Suppose that you have a software RAID array. Suppose further that you have a disk or two fail spectacularly; they don't just have errors, they go offline completely.

Naturally, software RAID fails the disks out; you wind up with something in /proc/mdstat that looks like this:

md10 : active raid6 sdbd1[12] sdbc1[11] sdbb1[10] sdba1[9] sdaz1[13](F) sday1[7] sdax1[6] sdaw1[5] sdav1[14](F) sdau1[3] sdat1[2] sdas1[1] sdar1[0]

(Yes, this system does have a lot of disks. Part of it is that multipathed FibreChannel makes disks multiply like rabbits.)

So we want to remove the failed disks from the array (perhaps because we have pulled out their hot-swap drive sleds in order to swap new disks in):

# mdadm /dev/md10 -r /dev/sdav1
mdadm: cannot find /dev/sdav1: No such file or directory

This would be because udev removed the /dev nodes for the disks when they went offline, which is perfectly sensible behavior except it presents us with a bit of a chicken and egg problem.

(If this was a Fedora system with mdadm 2.6.2 I might be able to use the '-r failed' option, but this is a Red Hat Enterprise 5 system with mdadm 2.5.4, and I am out of luck. And if I wanted to remove just one of the two failed drives, I would still be out of luck even on Fedora.)

Reinserting the drives doesn't help, at least in this case, as the system sees them as entirely new drives and assigns them a different sd-something name. (It does this even if they are literally the same disk, because you artificially induced this failure by pulling the drive sleds in the first place.)

linux/UdevWithFailedDisks written at 23:53:38; Add Comment

The difference between operations and system administration

Here is a thought that just crystallized for me:

In operations, crud rains down out of the sky and the sysadmins have to make it go and keep it going.

In system administration, at least you get to design and build the crud yourself.

(Corollaries about things you inherit from a previous sysadmin when you move into an existing environment are left as an exercise for the reader.)

sysadmin/OperationsVsSystemAdmin written at 22:09:41; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.