2014-03-28
Recovering from a drive failure on Fedora 20 with LVM on software RAID
My office workstation runs on two mirrored disks. For various reasons
the mirroring is split; the root filesystem, swap, and /boot
are
directly on software RAID while things like my home directory
filesystem are on LVM on top of software RAID. Today I had one of
those two disks fail when I rebooted after applying a kernel upgrade;
much to my surprise this caused the entire boot process to fail.
The direct cause of the boot failure was that none of the LVM-based
filesystems could be mounted. At first I thought that this was just
because LVM hadn't activated, so I tried things like pvscan
; much to
my surprise and alarm this reported that there were no physical volumes
visible at all. Eventually I noticed that the software RAID array that
LVM sits on top of being reported as inactive
instead of active and
that I couldn't read from the /dev
entry for it.
The direct fix was to run 'mdadm --run /dev/md17
'. This activated the
array (and then udev activated LVM and systemd noticed that devices were
available for the missing filesystems and mounted them). This was only
necessary once; after a reboot (with the failed disk still missing) the
array came up fine. I was led to this by the description of --run
in
the mdadm
manpage:
Attempt to start the array even if fewer drives were given than were present last time the array was active. Normally if not all the expected drives are found and
--scan
is not used, then the array will be assembled but not started. With--run
an attempt will be made to start it anyway.
In theory this matched the situation; the last time the array was active
it had two drives and now it only had one. The mystery here is that the
exact same thing was true for the other mirrors (for /
, swap, and
/boot
) and yet they were activated anyways despite the missing drive.
My only theory for what happened is that something exists that forces
activation of mirrors that are seen as necessary for filesystems but
doesn't force activation of other mirrors. This something is clearly
magical and hidden and of course not working properly. Perhaps this
magic lives in mount
(or the internal systemd equivalent); perhaps it
lives in systemd itself. It's pretty much impossible for me to tell.
(Of course since I have no idea what component is responsible I have no particularly good way to report this bug to Fedora. What am I supposed to report it against?)
(I'm writing this down partly because this may sometime happen to my home system (since it has roughly the same configuration) and if I didn't document my fix and had to reinvent it I would be very angry at myself.)
How we wound up with a RFC 1918 IP address visible in our public DNS
This is kind of a war story.
The whole saga started with a modern, sophisticated, Internet enabled projector, one that supports 'network projection' where you use software to feed it things to display instead of connecting to a VGA port or the like. This is quite handy because an increasing number of devices that people want to do presentations from simply do not have a spare VGA port, for example tablets. This network projection requires special software and, as we found out, this software absolutely does not work if there is NAT'ing in the way between your device and the projector. Unfortunately in our environment this is a real problem for wireless devices (such as those tablets) because there is no way off our wireless network without going through a NAT gateway of some sort.
(One of many reasons that this is required is that the wireless network uses RFC 1918 IP address space.)
If getting off the wireless network requires NAT and the software can't work with NAT, the conclusion is simple: we have to put the data projector on the wireless network (on what is amusingly called the 'wired wireless'). Wireless devices can talk to it, wired devices can talk to it by plugging into a little switch next to it, and everything is happy. But what about DNS? People would like to connect to the data projector by name, not just by IP address.
Like many places we have a 'split horizon' DNS setup, with internal DNS and public DNS. People using our VPN to authenticate on the wireless network and get access to internal services use the internal DNS servers, which are already full of RFC 1918 IP addresses for machines in our sandboxes. Unfortunately it's also possible to register wireless devices for what we call the 'airport experience', where we give devices external connectivity to the campus but no special access to our internal networks (as we feel that wireless MAC addresses aren't sufficient authentication for internal network access).
Devices using the airport experience can't use our internal DNS servers, partly because many of the IP addresses that the DNS servers would return can't be used outside our internal networks. Instead they get DNS from general campus recursive DNS servers, which of course use our public DNS data. Yet these devices still need to be able to look up the name for the data projector and get the wireless network's RFC 1918 IP address for it so they can talk to it directly with no NATing. The simplest, lowest overhead way to do this was to put the RFC 1918 wireless IP address for the data projector into our public DNS.
And that is why our public DNS now has a DNS record with a RFC 1918 IP address.
(I confessed to this today on Twitter so I decided that I might as well tell the story here.)
PS: people will probably suggest dnsmasq as a possible solution. It might be one but we aren't already using it, so at a minimum it'd be much more work than adding a DNS entry to our public DNS.