2013-02-26
How Linux servers should boot with software RAID
One of the things that I like least about the modern Linux server environment is how badly software RAID interacts with the initramfs in modern distributions and how easy it is to detonate your systems as a result. Ubuntu is especially obnoxious, but it's not the only offender. When this stuff blows up in my face it makes me angry for all sorts of reasons; not just that things have blown up (usually because I failed to do a magic rain dance, sometimes just on their own) but also because the people who built the initramfs systems for this should have known better.
Linux booting with software RAID arrays should work like this:
- the initramfs should not care about or even try to assemble
any RAID array(s) apart from those necessary to start the root
filesystem and maybe any other immediately crucial filesystems
(
/usr
, perhaps). Yes, working out which RAID array is needed for the root filesystem might take a little bit of work. Tough; do it anyways when you build the initramfs.(If the root filesystem can't be found, it's sensible for the the initramfs to ask you if it should assemble any other RAID arrays it can find to see if the root filesystem was lurking on one of them. But this is part of the initramfs recovery environment, not part of normal booting.)
- provided that the RAID array(s) are intact and functioning,
failure of the RAID arrays to exactly correspond to the expected
configuration should not abort booting. To be blunt, if you add
a third mirror to your root filesystem's RAID array your system
should not then fail to boot because you didn't perform a magic
rain dance.
(Yes, this happened to me. It can happen to you.)
- server oriented installs should default to booting with degraded
but working RAID arrays for the root filesystem (or any other
filesystem). Allow the sysadmin to change that if they want, ask
about it in the installer if you want, but trust me, almost no
sysadmin is RAIDing their filesystems so that the machine can
fail to start if one disk dies. It's generally the exact opposite.
- additional RAID arrays should be assembled only after the system
has started to boot on the real root filesystem. Ideally they
would be assembled only if they are listed in
mdadm.conf
or if they seem necessary to mount some/etc/fstab
filesystem. If the system can't assemble RAID arrays that are necessary for/etc/fstab
listed filesystems, well, this is no different than if some filesystem in/etc/fstab
is unavailable for another reason; drop the system into a single-user shell or whatever else you do.
I think that it's legitimate for a system to treat failure to assemble
an array listed in mdadm.conf
as equivalent to failure to mount a
filesystem. Other people may disagree. Providing a configuration option
for this may be wise.
It may be that there are some system configurations where the initramfs building system absolutely can't work out what RAID array(s) are needed for the root filesystem. In this situation and this situation only the initramfs can try more RAID arrays than usual. But it should not have to in straightforward situations such as filesystem on RAID or filesystem on LVM on RAID; both are common and can be handled without problems.
The following are somewhat less important wishlist items.
- there should be an obvious, accessible setting for 'only touch
RAID arrays that are listed in
mdadm.conf
, never touch ones that aren't'. - the system should never attempt to automatically assemble a
damaged RAID array that is not listed in
mdadm.conf
, no matter what. When the system encounters an unknown RAID array its first directive should be 'do no harm'; it should only touch the RAID array if it can be sure that it is not damaging anything. An incomplete RAID array or one that needs resynchronization does not qualify. - the initramfs should not contain a copy of
mdadm.conf
. There are too many things that can go wrong if there is one, even if it's not supposed to be consulted. The only thing that the initramfs really needs to boot the system is the UUID(s) of the crucial RAID array(s), and it should contain this information directly.
(If some software absolutely has to have a non-empty mdadm.conf
to work, the initramfs mdadm.conf
should be specially created
with the absolute minimum necessary information in it. Copying the
normal system mdadm.conf
is both lazy and dangerous.)
Sidebar: where initramfses fail today
There are two broad failures and thus two failure modes.
One failure is that your initramfs likely silently includes a copy
of mdadm.conf
and generally uses the information there to assemble
RAID arrays. If this initramfs copy of mdadm.conf
no longer matches
reality, bad things often happen. Often this mismatch doesn't have to
be particularly major; in some systems, it can even be trivial. This
is especially dangerous because major distributions don't put a big
bold-type warning in mdadm.conf
saying 'if you change this file,
immediately rebuild your initramfs by doing ...'. (Not that this is good
enough.)
The other failure is Ubuntu's failure,
where the initramfs tries to assemble all RAID arrays that it can find
on devices, whether or not they are listed in its mdadm.conf
, and then
if any of them fail to assemble properly the initramfs throws up its
hands and stops. This is terrible in all sorts of ways that should have
been obvious to the clever people who put all the pieces of this scheme
together.
(Rebuilding the initramfs is not good enough partly because it isn't
rebuilding just one initramfs; you need to rebuild the initramfs for
all of the theoretically functioning kernels you have sitting around in
/boot
, assuming that you actually want to be able to use them.)
Thinking about how much Solaris 11 is worth to us
As a result of some feedback I've gotten on earlier entries I've wound up thinking about what I'll summarize as how much Solaris 11 is worth to us, ie what we might pay for it. To start with, is it worth anything at all?
My answer is 'yes, under the right circumstances' (one of those circumstances being that we get source code). Despite what I've said in the past about Illumos and FreeBSD, Solaris 11 is still in many ways the least risky option for us. It's not perfect but to put it one way it's the devil we know. I still have uncertainties about Oracle's actual commitment to it but then I have the same issues with Illumos.
So, how much would we pay for Solaris 11? Unfortunately I think the answer to that is 'not very much'. It's not zero (we've paid for Solaris before) but our actual budget is not very big and the direct benefits to using Solaris 11 are only moderate. My guess is that $100 a server a year would be acceptable (call it $1000 a year total), $200/server/year would be at best marginal, and more than that is really unlikely. It'd be very hard to argue that using Solaris 11 over a carefully validated FreeBSD configuration would be worth $2k/year.
(To put it one way, the larger the amount of money involved the more it looks like we (the sysadmins) are trying to just spend money instead of taking the time to do our job to carefully build a working environment. It would be one thing if the alternatives were clearly incapable and Solaris 11 was the only choice, but they're not and it isn't. Given the university attitude on staff time, we can't even argue that the savings in staff time are worth the expense.)
PS: the question of whether Oracle would give us either Solaris 11 source code or prices anywhere near this low is an entirely different matter. My personal expectation is that either issue would be met with the polite version of hysterical laughter, given that comparatively speaking we're an insignificant flyspeck.
Looking at whether (some) IP addresses persist in zen.spamhaus.org
After writing my entry on the shifting SBL I started to wonder how many IP addresses we reject for being SBL listed stop being SBL listed after a (moderate) while. I can't answer that directly, because we actually use the combined Zen Spamhaus list and we don't log the specific return codes, but I can answer a related question: how many Zen-listed IP addresses seem to stay in the Zen lists?
To check this, I pulled 10 days of records from January 18th through January 27th, extracted all of the distinct IPs that we found listed in zen.spamhaus.org, and re-queried Zen now to see how many of them are still there. Over that ten day period we had 613 Zen-listed IP addresses; today, 534 of them are still in the Zen. So a fairly decent number stay present for 30 days or more.
(Technically some of them could have disappeared and then reappeared.)
I also pulled specific return codes for all of those IP addresses, so I can now give you a breakdown of why those 534 addresses are still present:
- 420 of them are in Spamhaus-maintained PBL data. There's no single
really big source, but 46 of them are from Beltelecom in Belarus
(AS6697)
and 23 are from Chinanet (AS4134).
- 70 of them are in the XBL, specifically in the CBL.
- 56 are in the SBL. There's no really big source, but five IPs are
from 177.47.102.0/24 aka SBL136747, four are from
5.135.106.0/27 aka SBL173923, and two are
from 212.174.85.0/24 aka SBL107558.
(Two of those SBL listings are depressingly old, not that I am really surprised by long-term SBL listings by this point.)
- 47 of them are in ISP-maintained PBL data.
- 9 of them are in the SBL CSS, which is pretty impressive and depressing because SBL CSS listings expire fairly fast.
An equally interesting question is how many of those 79 now-unlisted IPs are listed in some other DNS blocklist. The answer turns out to be a fair number; 60 are still listed on some DNS blocklist that I have in my program to check IPs against a big collection of DNSBls. Many but not all of the hits are for b.barracudacentral.org (which is not a DNSBl that I consider to be really high quality; it seems to be more of a hair-trigger lister).
(I'm out of touch with what's considered a high-quality DNSBl versus lower-quality ones so I'm not going to offer further reporting or opinions.)