2013-02-26
How Linux servers should boot with software RAID
One of the things that I like least about the modern Linux server environment is how badly software RAID interacts with the initramfs in modern distributions and how easy it is to detonate your systems as a result. Ubuntu is especially obnoxious, but it's not the only offender. When this stuff blows up in my face it makes me angry for all sorts of reasons; not just that things have blown up (usually because I failed to do a magic rain dance, sometimes just on their own) but also because the people who built the initramfs systems for this should have known better.
Linux booting with software RAID arrays should work like this:
- the initramfs should not care about or even try to assemble
any RAID array(s) apart from those necessary to start the root
filesystem and maybe any other immediately crucial filesystems
(
/usr, perhaps). Yes, working out which RAID array is needed for the root filesystem might take a little bit of work. Tough; do it anyways when you build the initramfs.(If the root filesystem can't be found, it's sensible for the the initramfs to ask you if it should assemble any other RAID arrays it can find to see if the root filesystem was lurking on one of them. But this is part of the initramfs recovery environment, not part of normal booting.)
- provided that the RAID array(s) are intact and functioning,
failure of the RAID arrays to exactly correspond to the expected
configuration should not abort booting. To be blunt, if you add
a third mirror to your root filesystem's RAID array your system
should not then fail to boot because you didn't perform a magic
rain dance.
(Yes, this happened to me. It can happen to you.)
- server oriented installs should default to booting with degraded
but working RAID arrays for the root filesystem (or any other
filesystem). Allow the sysadmin to change that if they want, ask
about it in the installer if you want, but trust me, almost no
sysadmin is RAIDing their filesystems so that the machine can
fail to start if one disk dies. It's generally the exact opposite.
- additional RAID arrays should be assembled only after the system
has started to boot on the real root filesystem. Ideally they
would be assembled only if they are listed in
mdadm.confor if they seem necessary to mount some/etc/fstabfilesystem. If the system can't assemble RAID arrays that are necessary for/etc/fstablisted filesystems, well, this is no different than if some filesystem in/etc/fstabis unavailable for another reason; drop the system into a single-user shell or whatever else you do.
I think that it's legitimate for a system to treat failure to assemble
an array listed in mdadm.conf as equivalent to failure to mount a
filesystem. Other people may disagree. Providing a configuration option
for this may be wise.
It may be that there are some system configurations where the initramfs building system absolutely can't work out what RAID array(s) are needed for the root filesystem. In this situation and this situation only the initramfs can try more RAID arrays than usual. But it should not have to in straightforward situations such as filesystem on RAID or filesystem on LVM on RAID; both are common and can be handled without problems.
The following are somewhat less important wishlist items.
- there should be an obvious, accessible setting for 'only touch
RAID arrays that are listed in
mdadm.conf, never touch ones that aren't'. - the system should never attempt to automatically assemble a
damaged RAID array that is not listed in
mdadm.conf, no matter what. When the system encounters an unknown RAID array its first directive should be 'do no harm'; it should only touch the RAID array if it can be sure that it is not damaging anything. An incomplete RAID array or one that needs resynchronization does not qualify. - the initramfs should not contain a copy of
mdadm.conf. There are too many things that can go wrong if there is one, even if it's not supposed to be consulted. The only thing that the initramfs really needs to boot the system is the UUID(s) of the crucial RAID array(s), and it should contain this information directly.
(If some software absolutely has to have a non-empty mdadm.conf
to work, the initramfs mdadm.conf should be specially created
with the absolute minimum necessary information in it. Copying the
normal system mdadm.conf is both lazy and dangerous.)
Sidebar: where initramfses fail today
There are two broad failures and thus two failure modes.
One failure is that your initramfs likely silently includes a copy
of mdadm.conf and generally uses the information there to assemble
RAID arrays. If this initramfs copy of mdadm.conf no longer matches
reality, bad things often happen. Often this mismatch doesn't have to
be particularly major; in some systems, it can even be trivial. This
is especially dangerous because major distributions don't put a big
bold-type warning in mdadm.conf saying 'if you change this file,
immediately rebuild your initramfs by doing ...'. (Not that this is good
enough.)
The other failure is Ubuntu's failure,
where the initramfs tries to assemble all RAID arrays that it can find
on devices, whether or not they are listed in its mdadm.conf, and then
if any of them fail to assemble properly the initramfs throws up its
hands and stops. This is terrible in all sorts of ways that should have
been obvious to the clever people who put all the pieces of this scheme
together.
(Rebuilding the initramfs is not good enough partly because it isn't
rebuilding just one initramfs; you need to rebuild the initramfs for
all of the theoretically functioning kernels you have sitting around in
/boot, assuming that you actually want to be able to use them.)
2013-02-12
Some notes on Linux's ionice
I was all set to write an entry praising ionice as a perhaps
overlooked but rather handy little command, but then I decided to
actually measure things on Ubuntu 12.04 to make sure that I wasn't
fooling myself. Now you (and I) get a different set of notes.
In theory, ionice allows you
to prioritize a command's IO the way that nice(1) theoretically
prioritizes its CPU usage. This would be a handy way to allow, say, a
big but relatively important compile to grind away in the background
without getting in the way of your interactive use of the machine.
(Why yes, I do recompile Firefox from source every so often.)
In practice there are two flies in this ointment. The first
is that ionice only works with the CFQ disk scheduler. CFQ is the default for scheduling
actual physical disks, but small things like software RAID and
LVM do not have disk schedulers at all and as far as I can tell
ionice is completely ineffectual on them (for both read and write
IO). Unfortunately this renders ionice pointless on my workstation
(which is all LVM over software RAID).
The next problem is that even when running directly on a disk, ionice
does nothing to de-prioritize asynchronous write IO. This is, well,
most of the write IO that most programs will do. Ionice may slow down
synchronous writes (I don't have a test program) and it definitely works
for reads, but that's it. This might still make a compile not eat your
machine (if you're not using LVM or software RAID) since it needs to
read things as well as write them, but it now really depends.
What actually doing these tests has show me is that any improvements I
thought I was getting from ionice on my workstation was me fooling
myself; because I thought ionice would make things nicer, I thought it
was (this is a well known effect, of course). In practice ionice is
not likely to be of much use to me until it works through (or over) LVM
and software RAID. Working on write IO too would be even better.
(My understanding is that write IO is a hard problem in the kernel, but that's another entry.)
Oh well. Not every nifty looking thing actually works out in practice.
2013-02-08
Linux's great advantage as a Unix is its willingness to make wrenching changes
Unix needs to grow but the problem with its growth is that the first attempts at anything in particular are almost certainly going to be not the right answer and not particularly Unixy (if you're unlucky you get crawling horrors created by people who don't actually care about Unix). In order to grow towards a truly Unix solution to any particular problem, you need to be willing to repeatedly make changes. Often changes will not be an incremental evolution of the previous change but instead someone looking at it, going 'this is not the right approach', and taking an entirely different and better approach.
If you allow yourself no changes, your Unix stagnates or locks itself up in a server ghetto or both. If you allow yourself one significant change but then either stop or only allow small incremental changes to the first change, you are almost certainly not going to wind up with a truly Unixy solution. The only way to really iterate towards a Unixy solution to problems is to be willing to repeatedly make wrenching changes, to throw away all of your previous work on a problem because it turns out to be the wrong approach and try again (with no more guarantee that you've really got the right Unixy answer than before, of course).
This is Linux's great advantage as a Unix: it is willing to repeatedly make wrenching changes, which means it has a chance of iterating to Unixy answers. Linux is willing to try things in the hopes that they will be a step towards the right answer, and then if they prove not to be (as they probably won't be), it'll throw them out and start over again when someone has something better. It's willing to do this even when the changes involve significant disruption.
There are downsides to this, of course. Repeated changes create churn and many of the changes are not entirely good ideas (and some are really bad, or just bad hacks). But it beats stagnation, which is the other actual alternative.
Unfortunately I don't think any other current Unix has anything like this real willingness to make changes (which implies being willing to make mistakes; if you insist that things be nearly perfect first, you're not really willing to make changes). In fact my perception is that this is one of the points of polarization in the modern Unix world.
(This is an elaboration on one of my tweets.)