Recovering from a drive failure on Fedora 20 with LVM on software RAID
My office workstation runs on two mirrored disks. For various reasons
the mirroring is split; the root filesystem, swap, and
directly on software RAID while things like my home directory
filesystem are on LVM on top of software RAID. Today I had one of
those two disks fail when I rebooted after applying a kernel upgrade;
much to my surprise this caused the entire boot process to fail.
The direct cause of the boot failure was that none of the LVM-based
filesystems could be mounted. At first I thought that this was just
because LVM hadn't activated, so I tried things like
pvscan; much to
my surprise and alarm this reported that there were no physical volumes
visible at all. Eventually I noticed that the software RAID array that
LVM sits on top of being reported as
inactive instead of active and
that I couldn't read from the
/dev entry for it.
The direct fix was to run '
mdadm --run /dev/md17'. This activated the
array (and then udev activated LVM and systemd noticed that devices were
available for the missing filesystems and mounted them). This was only
necessary once; after a reboot (with the failed disk still missing) the
array came up fine. I was led to this by the description of
Attempt to start the array even if fewer drives were given than were present last time the array was active. Normally if not all the expected drives are found and
--scanis not used, then the array will be assembled but not started. With
--runan attempt will be made to start it anyway.
In theory this matched the situation; the last time the array was active
it had two drives and now it only had one. The mystery here is that the
exact same thing was true for the other mirrors (for
/, swap, and
/boot) and yet they were activated anyways despite the missing drive.
My only theory for what happened is that something exists that forces
activation of mirrors that are seen as necessary for filesystems but
doesn't force activation of other mirrors. This something is clearly
magical and hidden and of course not working properly. Perhaps this
magic lives in
mount (or the internal systemd equivalent); perhaps it
lives in systemd itself. It's pretty much impossible for me to tell.
(Of course since I have no idea what component is responsible I have no particularly good way to report this bug to Fedora. What am I supposed to report it against?)
(I'm writing this down partly because this may sometime happen to my home system (since it has roughly the same configuration) and if I didn't document my fix and had to reinvent it I would be very angry at myself.)