Wandering Thoughts archives

2014-03-28

Recovering from a drive failure on Fedora 20 with LVM on software RAID

My office workstation runs on two mirrored disks. For various reasons the mirroring is split; the root filesystem, swap, and /boot are directly on software RAID while things like my home directory filesystem are on LVM on top of software RAID. Today I had one of those two disks fail when I rebooted after applying a kernel upgrade; much to my surprise this caused the entire boot process to fail.

The direct cause of the boot failure was that none of the LVM-based filesystems could be mounted. At first I thought that this was just because LVM hadn't activated, so I tried things like pvscan; much to my surprise and alarm this reported that there were no physical volumes visible at all. Eventually I noticed that the software RAID array that LVM sits on top of being reported as inactive instead of active and that I couldn't read from the /dev entry for it.

The direct fix was to run 'mdadm --run /dev/md17'. This activated the array (and then udev activated LVM and systemd noticed that devices were available for the missing filesystems and mounted them). This was only necessary once; after a reboot (with the failed disk still missing) the array came up fine. I was led to this by the description of --run in the mdadm manpage:

Attempt to start the array even if fewer drives were given than were present last time the array was active. Normally if not all the expected drives are found and --scan is not used, then the array will be assembled but not started. With --run an attempt will be made to start it anyway.

In theory this matched the situation; the last time the array was active it had two drives and now it only had one. The mystery here is that the exact same thing was true for the other mirrors (for /, swap, and /boot) and yet they were activated anyways despite the missing drive.

My only theory for what happened is that something exists that forces activation of mirrors that are seen as necessary for filesystems but doesn't force activation of other mirrors. This something is clearly magical and hidden and of course not working properly. Perhaps this magic lives in mount (or the internal systemd equivalent); perhaps it lives in systemd itself. It's pretty much impossible for me to tell.

(Of course since I have no idea what component is responsible I have no particularly good way to report this bug to Fedora. What am I supposed to report it against?)

(I'm writing this down partly because this may sometime happen to my home system (since it has roughly the same configuration) and if I didn't document my fix and had to reinvent it I would be very angry at myself.)

linux/Fedora20LVMDriveRecovery written at 18:05:00;

How we wound up with a RFC 1918 IP address visible in our public DNS

This is kind of a war story.

The whole saga started with a modern, sophisticated, Internet enabled projector, one that supports 'network projection' where you use software to feed it things to display instead of connecting to a VGA port or the like. This is quite handy because an increasing number of devices that people want to do presentations from simply do not have a spare VGA port, for example tablets. This network projection requires special software and, as we found out, this software absolutely does not work if there is NAT'ing in the way between your device and the projector. Unfortunately in our environment this is a real problem for wireless devices (such as those tablets) because there is no way off our wireless network without going through a NAT gateway of some sort.

(One of many reasons that this is required is that the wireless network uses RFC 1918 IP address space.)

If getting off the wireless network requires NAT and the software can't work with NAT, the conclusion is simple: we have to put the data projector on the wireless network (on what is amusingly called the 'wired wireless'). Wireless devices can talk to it, wired devices can talk to it by plugging into a little switch next to it, and everything is happy. But what about DNS? People would like to connect to the data projector by name, not just by IP address.

Like many places we have a 'split horizon' DNS setup, with internal DNS and public DNS. People using our VPN to authenticate on the wireless network and get access to internal services use the internal DNS servers, which are already full of RFC 1918 IP addresses for machines in our sandboxes. Unfortunately it's also possible to register wireless devices for what we call the 'airport experience', where we give devices external connectivity to the campus but no special access to our internal networks (as we feel that wireless MAC addresses aren't sufficient authentication for internal network access).

Devices using the airport experience can't use our internal DNS servers, partly because many of the IP addresses that the DNS servers would return can't be used outside our internal networks. Instead they get DNS from general campus recursive DNS servers, which of course use our public DNS data. Yet these devices still need to be able to look up the name for the data projector and get the wireless network's RFC 1918 IP address for it so they can talk to it directly with no NATing. The simplest, lowest overhead way to do this was to put the RFC 1918 wireless IP address for the data projector into our public DNS.

And that is why our public DNS now has a DNS record with a RFC 1918 IP address.

(I confessed to this today on Twitter so I decided that I might as well tell the story here.)

PS: people will probably suggest dnsmasq as a possible solution. It might be one but we aren't already using it, so at a minimum it'd be much more work than adding a DNS entry to our public DNS.

sysadmin/RFC1918IPinPublicDNS written at 01:47:19;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.