Recovering from a drive failure on Fedora 20 with LVM on software RAID

March 28, 2014

My office workstation runs on two mirrored disks. For various reasons the mirroring is split; the root filesystem, swap, and /boot are directly on software RAID while things like my home directory filesystem are on LVM on top of software RAID. Today I had one of those two disks fail when I rebooted after applying a kernel upgrade; much to my surprise this caused the entire boot process to fail.

The direct cause of the boot failure was that none of the LVM-based filesystems could be mounted. At first I thought that this was just because LVM hadn't activated, so I tried things like pvscan; much to my surprise and alarm this reported that there were no physical volumes visible at all. Eventually I noticed that the software RAID array that LVM sits on top of being reported as inactive instead of active and that I couldn't read from the /dev entry for it.

The direct fix was to run 'mdadm --run /dev/md17'. This activated the array (and then udev activated LVM and systemd noticed that devices were available for the missing filesystems and mounted them). This was only necessary once; after a reboot (with the failed disk still missing) the array came up fine. I was led to this by the description of --run in the mdadm manpage:

Attempt to start the array even if fewer drives were given than were present last time the array was active. Normally if not all the expected drives are found and --scan is not used, then the array will be assembled but not started. With --run an attempt will be made to start it anyway.

In theory this matched the situation; the last time the array was active it had two drives and now it only had one. The mystery here is that the exact same thing was true for the other mirrors (for /, swap, and /boot) and yet they were activated anyways despite the missing drive.

My only theory for what happened is that something exists that forces activation of mirrors that are seen as necessary for filesystems but doesn't force activation of other mirrors. This something is clearly magical and hidden and of course not working properly. Perhaps this magic lives in mount (or the internal systemd equivalent); perhaps it lives in systemd itself. It's pretty much impossible for me to tell.

(Of course since I have no idea what component is responsible I have no particularly good way to report this bug to Fedora. What am I supposed to report it against?)

(I'm writing this down partly because this may sometime happen to my home system (since it has roughly the same configuration) and if I didn't document my fix and had to reinvent it I would be very angry at myself.)


Comments on this page:

By psmears at 2014-09-10 11:41:24:

A lot of the magic to do with detecting and setting up RAID devices, LVM volumes etc. lives in the udev rules (if you ever feel the need to dig deeper...).

By cks at 2014-09-11 00:36:55:

I think that I took a look at the udev rules at the time but couldn't work out exactly where the 'incomplete' RAID arrays were or weren't activated. Based on a quick check of the Fedora 20 udev mdadm rules, I think that incomplete RAID arrays won't be activated by them (the rules use '_mdadm -I' without forcing things, I think).

By psmears at 2014-09-18 10:01:39:

Fair enough. When I had problems with a RAID not coming up I eventually tracked it down to issues with the udev rules, but that was related to dmraid rather than mdadm RAID...

Written on 28 March 2014.
« How we wound up with a RFC 1918 IP address visible in our public DNS
One of my worries: our spam filtering in the future »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Mar 28 18:05:00 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.