2013-11-29
How modern Linux software RAID arrays are assembled on boot (and otherwise)
Here is a question that is periodically relevant: just how does a modern Linux system assemble and bring up your software RAID arrays (and other software-defined things, for that matter)?
I've written about the history of this before so I'll summarize: in the very old days the
kernel did it all for you and in the not as old days it was done by a
script in your initial ramdisk that ran mdadm, often using an embedded
copy of your regular mdadm.conf.
The genesis of modern software RAID activation was udev and general
support for dynamically appearing devices, including 'hotplug' disk
devices (which was and is a good thing, to be clear). When disks
can appear over time, simply running mdadm once at some arbitrary
point is clearly not good enough. Instead the whole RAID assembly
system was changed so that every time a disk appears, udev arranges
to run mdadm in a special 'incremental' mode. As the manpage
describes it:
Add a single device to an appropriate array. If the addition of the device makes the array runnable, the array will be started. This provides a convenient interface to a hot-plug system.
A modern Linux system embeds a copy of udev (and the important
udev rules and various supporting bits) in the initramfs and
starts it early in the initramfs boot process. The kernel feeds this
copy of udev events corresponding to all of the hardware that has
been recognized so far and then udev starts kicking off more or
less all of its usual processing, including handling newly appeared
disk devices and thus incrementally assembling your software RAID
arrays. Hopefully this process fully completes before you need the
RAID arrays.
(I'm not sure when and how this incremental assembly process decides
that a RAID array is ready to be started, given that ideally you'd
want all of an array's devices to be present instead of just the
minimum number. Note that the intelligence for this is in mdadm,
not udev.)
The same general process is used to assemble and activate things like
LVM physical volumes and volume groups; as devices appear, udev runs
appropriate LVM commands to incrementally update the collection of known
physical volumes and so on and activate any that have become ready for
it. This implies that one physical disk finally appearing can cause a
cascade of subsequent events as the physical disk causes a software RAID
device to be assembled, the new RAID device is reported back to udev
and recognized as an LVM physical volume, and so on.
Where exactly the udev rules for all of this live varies from
distribution to distribution, so really you need to grep through
/lib/udev/rules.d (or /usr/lib/udev/rules.d) to find and read
everything that mentions mdadm. Then you can read your mdadm
and mdadm.conf manpages to see what sort of control (if any) you
can exert over this process.
The drawback of this process is that there is no longer a clear chain of scripts or the like that you can read to follow (or predict) the various actions that get taken. Instead everything is event driven and thus much harder to trace (and much less obvious, and much more split up across many different files, and so on). A modern Linux system booting is a quite complicated asynchronous environment that is built from many separate little pieces. Generally it's not well documented.
One corollary of all of this is that it is remarkably hard to have
a disk device appear and then be left alone. The moment the kernel
sends the 'new device' event to udev (either during boot or when
the system is running), udev will start kicking off all of its
usual processing and so on. udevadm can be used to turn off event
processing in general but that's a rather blunt hammer (and may
have bad consequences if other important events happen during this).
For that matter you probably don't want to totally turn off processing
of the disk device's events given that udev is also responsible for
creating the /dev entries for newly appearing disks.
2013-11-18
Why booting Linux from a ZFS root filesystem with GRUB can be hard
When I talked about the current weak areas of ZFS on Linux, one of them was using ZFS as your root filesystem (generally with Grub). Today I want to talk a bit more about why this is a non-trivial thing and by extension why booting from any new filesystem type takes more than you might think.
But first, let's be technical here because there can be two filesystems
involved: your actual root filesystem and the boot filesystem. The boot
filesystem is the filesystem with your kernel and initramfs; the root
filesystem is /. These days it's common to have both in the same
filesystem and for now I'll assume that's the setup we're dealing with.
In order to boot from ZFS we need several things to work. First, the
GRUB bootloader code itself needs to understand enough about ZFS to be
able to read things from a ZFS filesystem, most notably the GRUB menu
file, the kernel, and the initramfs. Code for doing this is available
(Solaris and Illumos have booted from ZFS using GRUB for some time) and
based on the current Fedora source RPM for GRUB 2, it appears to be
integrated into the main GRUB source. I believe that this also means
that GRUB 2 will normally autodetect that /boot is on a ZFS filesystem
and build support for this into the GRUB boot image that's written to
your disk by grub-install (aka grub2-install on some machines).
(However the GRUB ZFS code may not support the latest ZFS pool and filesystem features and I'm not certain what sort of pool vdevs it supports. Also, I hope it goes without saying that your ZFS root pool needs to be entirely on devices that the BIOS will see at boot time, because those are the devices that GRUB 2 can talk to.)
But that's only the start, because GRUB is out of the picture after the kernel and initramfs have been loaded and started. Once the kernel is running the initramfs needs to include ZFS and know how to find the ZFS root pool and root filesystem. This needs both kernel ZFS modules and code in the initramfs scripts (and a way of specifying what the root pool and root filesystem are). Very few distributions will natively include this support because very few are packaging or officially supporting ZFS on Linux today.
If you have both of the above you can boot Linux with a ZFS root filesystem. The final step is to be able to build or rebuild your initramfs with the ZFS support (still) there. This needs whatever code in your distribution creates this stuff to be ZFS aware; it needs to recognize that your root is a ZFS filesystem, work out the right ZFS pool and filesystem name, and embed all of this (plus general ZFS support) into the initramfs, the GRUB menu, and so on. Again many distributions won't have this natively and will need replacement packages or patching or the like.
(All of this ignores the distribution's installer. If you want to be ZFS from the start, you need an installer that supports ZFS and includes the ZFS modules and tools and so on. My impression is that this is rare.)
What changes if you separate out the boot filesystem from the root filesystem and make the boot filesystem not a ZFS filesystem is that your version of GRUB doesn't need to understand ZFS any more. Since ZFS support seems to be common in any recent version of GRUB, this may not get you very much.
(In a sense this is good news, since it's much easier to fiddle around with the contents of your initramfs than it is to add support for another filesystem to GRUB. Once you have ZFS support for your kernel all you need to do is work out how to get it into the initramfs too.)
PS: You can check if your version of GRUB (well, GRUB 2) has ZFS support
by checking to see if you have a zfs.mod GRUB module hanging around
somewhere in /boot/grub2 or /boot/grub or wherever your distribution
puts it.
2013-11-17
The 10G Ethernet performance problem on Linux
It's clear that 10G Ethernet on Linux is not yet in the state that 1G Ethernet is, where you can simply assume that you'll get wire speed unless your hardware is terrible (but sometimes your hardware is terrible, or at least not great). Instead you need to tune things for best performance, and do so beyond the basics of MTU 9000 and large application buffers. There are even any number of resources on the web to tell you things about this; for example, I've recently been reading this one [PDF].
(There's also this one from 2008 [PDF slides] that I've seen referred to in a number of places.)
The problem here is simple: that paper is from 2009. Things have changed since 2009; in fact, I've seen things change between kernel 3.11.6 and kernel 3.12 (and they changed significantly between Ubuntu 12.04's 3.2.0 kernel and 3.11.6). Much of the other 10G tuning advice on the web I've found is like this, either clearly old or undated but probably old. Since they're old, some but not all of their performance tuning advice is likely out of date and either not necessary, not applicable any more, or actively counterproductive. Given the changes I've seen just between 3.11.6 and 3.12, this is probably going to continue to be the case for a while more; even carefully researched tuning advice written today may not apply in a year.
(At least not to current kernels. If you research tuning advice for, say, a RHEL/CentOS 6 kernel it's likely to stay useful for years because RHEL kernels don't change much.)
This is the 10G Ethernet performance problem on Linux as I see it. Today and for the likely future, getting good performance out of 10G Ethernet on Linux is going to take you real work. It's not enough to read some resources and follow their advice because parts of the advice may be out of date; you're going to have to experiment, ideally under real life scenarios not just artificial bandwidth or latency tests.
(Artificial tests can at best verify that under ideal circumstances you can hit wire bandwidth or wire latency. But the tuning you need for them may be different than the tuning you need for your live production load.)