Ubuntu does system disk mirroring right

November 6, 2011

When we started installing Ubuntu 10.04 systems with our standard mirrored system disk setup, we noticed that it asked us a new and (in my opinion) very stupid question: did we want the system to boot if only one of the two sides of the mirror were there? Of course we said 'yes', since that's part of why we're mirroring the system disk in the first place. Despite its sillyness this question was already an improvement over 8.04, where the installer defaulted to a 'no' and you had to edit the grub settings by hand to change this.

What we didn't notice at the time was what else the installer was doing with mirrored system disks. To wit, Ubuntu now installs GRUB on the second drive, as well as on the first one.

This is an important thing to do, because it's what makes your system bootable even if you lose (or pull) your entire primary drive. In the past it was a step that we had to remember to do by hand (with an appropriate peculiar incantation), which means that it was sometimes forgotten and so some of our systems have mirrored system disks but could not actually reboot if they lost the primary drive. Now all of our Ubuntu 10.04 machines have this handled automatically for us, which is great and also exactly what a system should do if it detects that /boot is mirrored across multiple drives.

Before I started looking, I was going to confidently assert that this was new behavior in Ubuntu 10.04. However, it appears likely that it's also in Ubuntu 8.04 and we just didn't notice; I've checked a few of our 8.04 machines where I'm reasonably certain that we didn't install GRUB on the second disk by hand, and they have GRUB boot blocks.

(Similarly, my just-installed Fedora 15 home machine has a GRUB boot block on the second drive and I'm completely sure I didn't install it by hand, so it looks like Fedora 15 is also smart enough to do this.)

On a side note, it's surprisingly hard to notice changes like this if you don't consciously check for them when you're working out your install procedures for a new distribution release. Our install procedure has always called for installing GRUB by hand on the second drive, so of course we carried that forward into the 8.04 and 10.04 install instructions. Even when this step got accidentally omitted on specific machine installs, we don't normally pull primary drives and do a test boot on the secondary drive. So it took a chain of circumstances that caused us to boot a system on the second drive in a situation where we didn't think we'd set up GRUB on the second disk, and then testing this by installing a test system, deliberately not doing that manual step, and trying to boot the system with just the second drive.

Sidebar: why the Ubuntu installer's question is stupid

The ostensible reason for having an option to not boot if you have a degraded mirror is because this risks data loss if you don't fix it. However, my personal feeling is that almost everyone who is choosing to mirror system disks is doing so in a situation where they would rather have the system continue to operate even with degraded mirrors; people who care that much about data loss are rare, and even then the Ubuntu question is an incomplete solution to the problem.

(Nothing stops the system if your mirror degrades while the system is running, and I think this is the far more likely case.)

I don't object to there being an option for this behavior, but I don't think this is worth a question during installation. If you find that 90% of your audience answers a question one way, stop asking the question and just let the 10% who need a different answer change it by hand afterwards.

(This suggestion is inapplicable for things that can't be changed afterwards, but this is not one of them.)


Comments on this page:

From 24.8.147.194 at 2011-11-06 09:27:53:

IMHO not booting with a degraded mirror is the right default option. Those liekly to properly montior a host for failed drives are the same group of people that are likely to change the default option to boot with a degraded mirror if need be.

Those likely to just take defaults are the same people that will not realize they are running a degraded mirror until they loose the 2nd drive.

From 173.206.85.190 at 2011-11-06 10:23:13:

But how are you going to rebuild the array if you can't boot? You could dig around for a rescue disk and figure out how to rebuild the array from there. However it is far far easier to replace the failed drive boot the system and let the system take care of the rebuild automatically.

IMHO the best option to by default boot the system but add a boot halting warning message stating that the array is degraded asking the user if they want to continue the boot. Obviously the warning would need to be trivially disabled.

By cks at 2011-11-06 12:54:11:

The problem with not booting when the mirror is degraded is that it is a highly incomplete solution to the problem. If your goal is preventing data loss due to losing the second drive when you haven't noticed that you're running on only one drive and your method is preventing the system from booting with degraded mirrors, then this works only to the extent that you reboot machines shortly after they lose the first drive. If the machine does not reboot soon after you lose the first drive, you can spend long or very long amounts of time running in exactly the situation you want to prevent.

Right now, almost all of our Ubuntu machines have been up 68 days (and the reason they went down was a power shutdown). Clearly 'fail on reboot' as a way of forcing us to deal with failed drives would be completely ineffective here.

Ubuntu's solution may sound good but it is almost completely ineffective in practice unless Ubuntu forces a reboot the moment they detect a failed mirror. As far as I know they do not and I rather expect that if they did, people would hit the roof.

From 24.8.147.194 at 2011-11-06 14:45:21:

There is no need to run for a recovery disk with a system that won't boot due to a degraded array, just add "disablehooks=dmraid" to the kernel options.

I don't view not booting a degraded array as a full solution but a compromise to the more blunt action that some NAS appliances take of simply turning off after x days.

I think it targets the home user who is more likely to reboot with greater frequnecy and not monitor log files for notices of failed hardware.

Written on 06 November 2011.
« Understanding Apache's Allow, Deny, and Order directives
An IPSec mystery with dropped packets »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 6 01:06:14 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.