2014-04-16
Where I feel that btrfs went wrong
I recently finished reading this LWN series on btrfs, which was the most in-depth exposure at the details of using btrfs that I've had so far. While I'm sure that LWN intended the series to make people enthused about btrfs, I came away with a rather different reaction; I've wound up feeling that btrfs has made a significant misstep along its way that's resulted in a number of design mistakes. To explain why I feel this way I need to contrast it with ZFS.
Btrfs and ZFS are each both volume managers and filesystems merged together. One of the fundamental interface differences between them is that ZFS has decided that it is a volume manager first and a filesystem second, while btrfs has decided that it is a filesystem first and a volume manager second. This is what I see as btrfs's core mistake.
(Overall I've been left with the strong impression that btrfs basically considers volume management to be icky and tries to have as little to do with it as possible. If correct, this is a terrible mistake.)
Since it's a volume manager first, ZFS places volume management front
and center in operation. Before you do anything ZFS-related, you need
to create a ZFS volume (which ZFS calls a pool); only once this is done
do you really start dealing with ZFS filesystems. ZFS even puts the two
jobs in two different commands (zpool for pool management, zfs
for filesystem management). Because it's firmly made this split, ZFS is
free to have filesystem level things such as df present a logical,
filesystem based view of things like free space and device usage. If
you want the actual physical details you go to the volume management
commands.
Because btrfs puts the filesystem first it wedges volume creation in
as a side effect of filesystem creation, not a separate activity,
and then it carries a series of lies and uselessly physical details
through to filesystem level operations like df. Consider the the
discussion of what df shows for a RAID1 btrfs filesystem here, which has both a lie (that the
filesystem uses only a single physical device) and a needlessly physical
view (of the physical block usage and space free on a RAID 1 mirror
pair). That btrfs refuses to expose itself as a first class volume
manager and pretends that you're dealing with real devices forces
it into utterly awkward things like mounting a multi-device btrfs
filesystem with 'mount /dev/adevice /mnt'.
I think that this also leads to the asinine design decision that subvolumes have magic flat numeric IDs instead of useful names. Something that's willing to admit it's a volume manager, such as LVM or ZFS, has a name for the volume and can then hang sub-names off that name in a sensible way, even if where those sub-objects appear in the filesystem hierarchy (and under what names) gets shuffled around. But btrfs has no name for the volume to start with and there you go (the filesystem-volume has a mount point, but that's a different thing).
All of this really matters for how easily you can manage and keep track
of things. df on ZFS filesystems does not lie to me; it tells me where
the filesystem comes from (what pool and what object path within the
pool), how much logical space the filesystem is using (more or less),
and roughly how much more I can write to it. Since they have full names,
ZFS objects such as snapshots can be more or less self documenting if
you name them well. With an object hierarchy, ZFS has a natural way to
inherit various things from parent object to sub-objects. And so on.
Btrfs's 'I am not a volume manager' approach also leads it to drastically limit the physical shape of a btrfs RAID array in a way that is actually painfully limiting. In ZFS, a pool stripes its data over a number of vdevs and each vdev can be any RAID type with any number of devices. Because ZFS allows multi-way mirrors this creates a straightforward way to create a three-way or four-way RAID 10 array; you just make all of the vdevs be three or four way mirrors. You can also change the mirror count on the fly, which is handy for all sorts of operations. In btrfs, the shape 'raid10' is a top level property of the overall btrfs 'filesystem' and, well, that's all you get. There is no easy place to put in multi-way mirroring; because of btrfs's model of not being a volume manager it would require changes in any number of places.
(And while I'm here, that btrfs requires you to specify both your data and your metadata RAID levels is crazy and gives people a great way to accidentally blow their own foot off.)
As a side note, I believe that btrfs's lack of allocation guarantees in a raid10 setup makes it impossible to create a btrfs filesystem split evenly across two controllers that is guaranteed to survive the loss of one entire controller. In ZFS this is trivial because of the explicit structure of vdevs in the pool.
PS: ZFS is too permissive in how you can assemble vdevs, because there is almost no point of a pool with, say, a mirror vdev plus a RAID-6 vdev. That configuration is all but guaranteed to be a mistake in some way.
2014-04-11
What sort of kernel command line arguments Fedora 20's dracut seems to want
Recently I upgraded the kernel on my Fedora 20 office workstation, rebooted the machine, and had it hang in early boot (the first two are routine, the last is not). Forcing a reboot back to the earlier kernel brought things back to life. After a bunch of investigation I discovered that this was not actually due to the new kernel, it was due to an earlier dracut update. So this is the first thing to learn: if a dracut update breaks something in the boot process, you'll probably only discover this the next time you upgrade the kernel and the (new) dracut builds a (new and not working) initramfs for it.
The second thing I discovered in the process of this is the Fedora boot process will wait for a really long time for your root filesystem to appear before giving up, printing messages about it, and giving you an emergency shell, where by a really long time I mean 'many minutes' (I think at least five). It turned out that my boot process had not locked up but instead it was sitting around waiting my root filesystem to appear. Of course this wait was silent, with no warnings or status notes reported on the console, so I thought that things had hung. The reason the boot process couldn't find my root filesystem was that my root filesystem is on software RAID and the new dracut has stopped assembling such things for a bunch of people.
(Fedora apparently considers this new dracut state to be 'working as designed', based on bug reports I've skimmed.)
I don't know exactly what changed between the old dracut and the
new dracut, but what I do know is that the new dracut really wants
you to explicitly tell it what software RAID devices, LVM devices,
or other things to bring up on boot through arguments added to the
kernel command line. dracut.cmdline(7)
will tell you all about the many options, but the really useful thing
to know is that you can get dracut itself to tell you what it wants
via 'dracut --print-cmdline'.
For me on my machine, this prints out (and booting wants):
- three
rd.md.uuid=<UUID>settings for the software RAID arrays of my root filesystem, the swap partition, and/boot. I'm not sure why dracut includes/bootbut I left it in. The kernel command line is already absurdly over-long on a modern Fedora machine, so whatever.(There are similar options for LVM volumes, LUKS, and so on.)
- a '
root=UUID=<UUID>' stanza to specify the UUID of the root filesystem. It's possible that my old 'root=/dev/mdXX' would have worked (the root's RAID array is assembled with the right name), but I didn't feel like finding out the hard way. rootflags=...androotfstype=ext4for more information about mounting the root filesystem.resume=UUID=<UUID>, which points to my swap area. I omitted this in the kernel command line I set ingrub.cfgbecause I never suspend my workstation. Nothing has exploded yet.
The simplest approach to fixing up your machine in a situation like
this is probably to just update grub.cfg to add everything dracut
wants to the new kernel's command line (removing any existing
conflicting options, eg an old root=/dev/XXX setting). I looked
into just what the arguments were and omitted one for no particularly
good reason.
(I won't say that Dracut is magic, because I'm sure it could all be read up on and decoded if I wanted to. I just think that doing so is not worth bothering with for most people. Modern Linux booting is functionally a black box, partly because it's so complex and partly because it almost always just works.)
2014-04-05
An important additional step when shifting software RAID mirrors around
After going through all of the steps from yesterday's entry to move my mirrors from one disk to another, I inadvertently discovered a vital additional step you need to take here. The additional step is:
- After you've taken the old disk out of the mirror and shrunk the
mirror (steps 4 and 5), either destroy the old disk's RAID
superblock or physically remove the disk from your system.
I believe that RAID superblocks can be destroyed with the following
(where
/dev/sdb7is the old disk):mdadm --zero-superblock /dev/sdb7
Failure to do this may cause your system to malfunction either subtly or spectacularly on boot (malfunctioning spectacularly is best because that insures you notice it). The culprit here is the how a modern Linux system assembles RAID arrays on boot. Put simply, there is nothing that forces all of your RAID arrays to be assembled using your current mirrors instead of the obsolete mirrors on your old disk. Instead it seems to come down to which device is processed first. If a partition on your old disk is processed first, it wins the race and becomes the sole member of the RAID array (which may then fail to activate because it doesn't have the full device set). If you're lucky your system now refuses to boot; if you're unlucky, your system boots but with obsolete and unmirrored filesystems and anything important written to them will cause you a great deal of heartburn as you try to sort out the resulting mess.
(Linux software RAID appears to be at least smart enough to know that your two current mirror devices and the old disk are not compatible and so doesn't glue them all together. I don't know what GRUB's software RAID code does here if your boot partition is on a software RAID mirror that has had this happen to it.)
This points out core architectural flaws in both the asynchronous
assembly process and the approach of removing obsolete devices by
failing them first. If mdadm had a 'remove active device' operation,
it could at least somehow mark the removed device's superblock as
'do not use to auto-assemble array, this device has been explicitly
removed'. If the assembly process was not asynchronous the way it is,
it could see that some mirror devices were more recent than others and
prefer them. But sadly, well, no.
(In theory a not yet activated software RAID array could be revised to kick out the out of date device and replace it with the newer device (although there are policy issues involved). This can't be done at all once the array has been activated, or rather while the array is active.)
2014-04-03
Shifting a software RAID mirror from disk to disk in modern Linux
Suppose that you have a software RAID mirror and you want to migrate one side of the mirror from one disk to another to replace the old disk. The straightforward way is to remove the old disk, put in the new disk, and resync the mirror. However this leaves you without a mirror at all for the duration of the resync so if you can get all three disks online at once what you'd like to do is add the new disk as a third mirror and then remove the old disk later. Modern Linux makes this a little bit complicated.
The core complication is that your software RAID devices know how many active mirrors they are supposed to have. If you add a device beyond that, it becomes a hot spare instead of being an active mirror. To activate it as a mirror you must add it then grow the number of active devices in the mirror. Then to properly deactivate the old disk you need to do the reverse.
Here are the actual commands (for my future use if nothing else):
- Hot-add the new device:
mdadm -a /dev/md17 /dev/sdd7If you look at
/proc/mdstatafterwards you'll see it marked as a spare. - 'Grow' the number of active devices in the mirror:
mdadm -G -n 3 /dev/md17 - Wait for the mirror to resync. You may want to run the new disk in
parallel with the old disk for a few days to make sure that all is
well with it; this is fine. You may want to be wary about reboots
during this time.
- Take the old disk out by first manually failing it and then actually
removing it:
mdadm --fail /dev/md17 /dev/sdb7
mdadm -r /dev/md17 /dev/sdb7 - Finally, shrink the number of active devices in the mirror down to two
again:
mdadm -G -n 2 /dev/md17
You really do want to explicitly shrink the number of active devices
in the mirror. A mismatch between the number of actual devices and the
number of expected devices can have various undesirable consequences. If a significant amount of time happened
between step three and four, make sure that your mdadm.conf still has
the correct number of devices configured in it for all of the arrays
(ie, two).
Unfortunately marking the old disk as failed will likely get you warning
email from mdadm's status monitoring about a failed device. This is
the drawback of mdadm not having a way to directly do 'remove an
active device' as a single action. I can understand why mdadm doesn't
have an operation for this, but it's still a bit annoying.
(Looking at this old entry makes it clear that I've run into the need to grow and shrink the number of active mirror devices before, but apparently I didn't consider it noteworthy at that point.)