udev, and failed disks
Suppose that you have a software RAID array. Suppose further that you have a disk or two fail spectacularly; they don't just have errors, they go offline completely.
Naturally, software RAID fails the disks out; you wind up with something
/proc/mdstat that looks like this:
md10 : active raid6 sdbd1 sdbc1 sdbb1 sdba1 sdaz1(F) sday1 sdax1 sdaw1 sdav1(F) sdau1 sdat1 sdas1 sdar1
(Yes, this system does have a lot of disks. Part of it is that multipathed FibreChannel makes disks multiply like rabbits.)
So we want to remove the failed disks from the array (perhaps because we have pulled out their hot-swap drive sleds in order to swap new disks in):
# mdadm /dev/md10 -r /dev/sdav1
mdadm: cannot find /dev/sdav1: No such file or directory
This would be because
udev removed the
/dev nodes for the disks
when they went offline, which is perfectly sensible behavior except
it presents us with a bit of a chicken and egg problem.
(If this was a Fedora system with mdadm 2.6.2 I might be able to use the
-r failed' option, but this is a Red Hat Enterprise 5 system with
mdadm 2.5.4, and I am out of luck. And if I wanted to remove just one
of the two failed drives, I would still be out of luck even on Fedora.)
Reinserting the drives doesn't help, at least in this case, as the system sees them as entirely new drives and assigns them a different sd-something name. (It does this even if they are literally the same disk, because you artificially induced this failure by pulling the drive sleds in the first place.)
The difference between operations and system administration
Here is a thought that just crystallized for me:
In operations, crud rains down out of the sky and the sysadmins have to make it go and keep it going.
In system administration, at least you get to design and build the crud yourself.
(Corollaries about things you inherit from a previous sysadmin when you move into an existing environment are left as an exercise for the reader.)