Wandering Thoughts archives

2014-04-16

Where I feel that btrfs went wrong

I recently finished reading this LWN series on btrfs, which was the most in-depth exposure at the details of using btrfs that I've had so far. While I'm sure that LWN intended the series to make people enthused about btrfs, I came away with a rather different reaction; I've wound up feeling that btrfs has made a significant misstep along its way that's resulted in a number of design mistakes. To explain why I feel this way I need to contrast it with ZFS.

Btrfs and ZFS are each both volume managers and filesystems merged together. One of the fundamental interface differences between them is that ZFS has decided that it is a volume manager first and a filesystem second, while btrfs has decided that it is a filesystem first and a volume manager second. This is what I see as btrfs's core mistake.

(Overall I've been left with the strong impression that btrfs basically considers volume management to be icky and tries to have as little to do with it as possible. If correct, this is a terrible mistake.)

Since it's a volume manager first, ZFS places volume management front and center in operation. Before you do anything ZFS-related, you need to create a ZFS volume (which ZFS calls a pool); only once this is done do you really start dealing with ZFS filesystems. ZFS even puts the two jobs in two different commands (zpool for pool management, zfs for filesystem management). Because it's firmly made this split, ZFS is free to have filesystem level things such as df present a logical, filesystem based view of things like free space and device usage. If you want the actual physical details you go to the volume management commands.

Because btrfs puts the filesystem first it wedges volume creation in as a side effect of filesystem creation, not a separate activity, and then it carries a series of lies and uselessly physical details through to filesystem level operations like df. Consider the the discussion of what df shows for a RAID1 btrfs filesystem here, which has both a lie (that the filesystem uses only a single physical device) and a needlessly physical view (of the physical block usage and space free on a RAID 1 mirror pair). That btrfs refuses to expose itself as a first class volume manager and pretends that you're dealing with real devices forces it into utterly awkward things like mounting a multi-device btrfs filesystem with 'mount /dev/adevice /mnt'.

I think that this also leads to the asinine design decision that subvolumes have magic flat numeric IDs instead of useful names. Something that's willing to admit it's a volume manager, such as LVM or ZFS, has a name for the volume and can then hang sub-names off that name in a sensible way, even if where those sub-objects appear in the filesystem hierarchy (and under what names) gets shuffled around. But btrfs has no name for the volume to start with and there you go (the filesystem-volume has a mount point, but that's a different thing).

All of this really matters for how easily you can manage and keep track of things. df on ZFS filesystems does not lie to me; it tells me where the filesystem comes from (what pool and what object path within the pool), how much logical space the filesystem is using (more or less), and roughly how much more I can write to it. Since they have full names, ZFS objects such as snapshots can be more or less self documenting if you name them well. With an object hierarchy, ZFS has a natural way to inherit various things from parent object to sub-objects. And so on.

Btrfs's 'I am not a volume manager' approach also leads it to drastically limit the physical shape of a btrfs RAID array in a way that is actually painfully limiting. In ZFS, a pool stripes its data over a number of vdevs and each vdev can be any RAID type with any number of devices. Because ZFS allows multi-way mirrors this creates a straightforward way to create a three-way or four-way RAID 10 array; you just make all of the vdevs be three or four way mirrors. You can also change the mirror count on the fly, which is handy for all sorts of operations. In btrfs, the shape 'raid10' is a top level property of the overall btrfs 'filesystem' and, well, that's all you get. There is no easy place to put in multi-way mirroring; because of btrfs's model of not being a volume manager it would require changes in any number of places.

(And while I'm here, that btrfs requires you to specify both your data and your metadata RAID levels is crazy and gives people a great way to accidentally blow their own foot off.)

As a side note, I believe that btrfs's lack of allocation guarantees in a raid10 setup makes it impossible to create a btrfs filesystem split evenly across two controllers that is guaranteed to survive the loss of one entire controller. In ZFS this is trivial because of the explicit structure of vdevs in the pool.

PS: ZFS is too permissive in how you can assemble vdevs, because there is almost no point of a pool with, say, a mirror vdev plus a RAID-6 vdev. That configuration is all but guaranteed to be a mistake in some way.

BtrfsCoreMistake written at 01:27:57; Add Comment

2014-04-11

What sort of kernel command line arguments Fedora 20's dracut seems to want

Recently I upgraded the kernel on my Fedora 20 office workstation, rebooted the machine, and had it hang in early boot (the first two are routine, the last is not). Forcing a reboot back to the earlier kernel brought things back to life. After a bunch of investigation I discovered that this was not actually due to the new kernel, it was due to an earlier dracut update. So this is the first thing to learn: if a dracut update breaks something in the boot process, you'll probably only discover this the next time you upgrade the kernel and the (new) dracut builds a (new and not working) initramfs for it.

The second thing I discovered in the process of this is the Fedora boot process will wait for a really long time for your root filesystem to appear before giving up, printing messages about it, and giving you an emergency shell, where by a really long time I mean 'many minutes' (I think at least five). It turned out that my boot process had not locked up but instead it was sitting around waiting my root filesystem to appear. Of course this wait was silent, with no warnings or status notes reported on the console, so I thought that things had hung. The reason the boot process couldn't find my root filesystem was that my root filesystem is on software RAID and the new dracut has stopped assembling such things for a bunch of people.

(Fedora apparently considers this new dracut state to be 'working as designed', based on bug reports I've skimmed.)

I don't know exactly what changed between the old dracut and the new dracut, but what I do know is that the new dracut really wants you to explicitly tell it what software RAID devices, LVM devices, or other things to bring up on boot through arguments added to the kernel command line. dracut.cmdline(7) will tell you all about the many options, but the really useful thing to know is that you can get dracut itself to tell you what it wants via 'dracut --print-cmdline'.

For me on my machine, this prints out (and booting wants):

  • three rd.md.uuid=<UUID> settings for the software RAID arrays of my root filesystem, the swap partition, and /boot. I'm not sure why dracut includes /boot but I left it in. The kernel command line is already absurdly over-long on a modern Fedora machine, so whatever.

    (There are similar options for LVM volumes, LUKS, and so on.)

  • a 'root=UUID=<UUID>' stanza to specify the UUID of the root filesystem. It's possible that my old 'root=/dev/mdXX' would have worked (the root's RAID array is assembled with the right name), but I didn't feel like finding out the hard way.

  • rootflags=... and rootfstype=ext4 for more information about mounting the root filesystem.

  • resume=UUID=<UUID>, which points to my swap area. I omitted this in the kernel command line I set in grub.cfg because I never suspend my workstation. Nothing has exploded yet.

The simplest approach to fixing up your machine in a situation like this is probably to just update grub.cfg to add everything dracut wants to the new kernel's command line (removing any existing conflicting options, eg an old root=/dev/XXX setting). I looked into just what the arguments were and omitted one for no particularly good reason.

(I won't say that Dracut is magic, because I'm sure it could all be read up on and decoded if I wanted to. I just think that doing so is not worth bothering with for most people. Modern Linux booting is functionally a black box, partly because it's so complex and partly because it almost always just works.)

DracutNeededArguments written at 02:11:02; Add Comment

2014-04-05

An important additional step when shifting software RAID mirrors around

After going through all of the steps from yesterday's entry to move my mirrors from one disk to another, I inadvertently discovered a vital additional step you need to take here. The additional step is:

  • After you've taken the old disk out of the mirror and shrunk the mirror (steps 4 and 5), either destroy the old disk's RAID superblock or physically remove the disk from your system. I believe that RAID superblocks can be destroyed with the following (where /dev/sdb7 is the old disk):
    mdadm --zero-superblock /dev/sdb7

Failure to do this may cause your system to malfunction either subtly or spectacularly on boot (malfunctioning spectacularly is best because that insures you notice it). The culprit here is the how a modern Linux system assembles RAID arrays on boot. Put simply, there is nothing that forces all of your RAID arrays to be assembled using your current mirrors instead of the obsolete mirrors on your old disk. Instead it seems to come down to which device is processed first. If a partition on your old disk is processed first, it wins the race and becomes the sole member of the RAID array (which may then fail to activate because it doesn't have the full device set). If you're lucky your system now refuses to boot; if you're unlucky, your system boots but with obsolete and unmirrored filesystems and anything important written to them will cause you a great deal of heartburn as you try to sort out the resulting mess.

(Linux software RAID appears to be at least smart enough to know that your two current mirror devices and the old disk are not compatible and so doesn't glue them all together. I don't know what GRUB's software RAID code does here if your boot partition is on a software RAID mirror that has had this happen to it.)

This points out core architectural flaws in both the asynchronous assembly process and the approach of removing obsolete devices by failing them first. If mdadm had a 'remove active device' operation, it could at least somehow mark the removed device's superblock as 'do not use to auto-assemble array, this device has been explicitly removed'. If the assembly process was not asynchronous the way it is, it could see that some mirror devices were more recent than others and prefer them. But sadly, well, no.

(In theory a not yet activated software RAID array could be revised to kick out the out of date device and replace it with the newer device (although there are policy issues involved). This can't be done at all once the array has been activated, or rather while the array is active.)

SoftwareRaidShiftingMirrorII written at 02:05:37; Add Comment

2014-04-03

Shifting a software RAID mirror from disk to disk in modern Linux

Suppose that you have a software RAID mirror and you want to migrate one side of the mirror from one disk to another to replace the old disk. The straightforward way is to remove the old disk, put in the new disk, and resync the mirror. However this leaves you without a mirror at all for the duration of the resync so if you can get all three disks online at once what you'd like to do is add the new disk as a third mirror and then remove the old disk later. Modern Linux makes this a little bit complicated.

The core complication is that your software RAID devices know how many active mirrors they are supposed to have. If you add a device beyond that, it becomes a hot spare instead of being an active mirror. To activate it as a mirror you must add it then grow the number of active devices in the mirror. Then to properly deactivate the old disk you need to do the reverse.

Here are the actual commands (for my future use if nothing else):

  1. Hot-add the new device:
    mdadm -a /dev/md17 /dev/sdd7

    If you look at /proc/mdstat afterwards you'll see it marked as a spare.

  2. 'Grow' the number of active devices in the mirror:
    mdadm -G -n 3 /dev/md17

  3. Wait for the mirror to resync. You may want to run the new disk in parallel with the old disk for a few days to make sure that all is well with it; this is fine. You may want to be wary about reboots during this time.

  4. Take the old disk out by first manually failing it and then actually removing it:
    mdadm --fail /dev/md17 /dev/sdb7
    mdadm -r /dev/md17 /dev/sdb7

  5. Finally, shrink the number of active devices in the mirror down to two again:
    mdadm -G -n 2 /dev/md17

You really do want to explicitly shrink the number of active devices in the mirror. A mismatch between the number of actual devices and the number of expected devices can have various undesirable consequences. If a significant amount of time happened between step three and four, make sure that your mdadm.conf still has the correct number of devices configured in it for all of the arrays (ie, two).

Unfortunately marking the old disk as failed will likely get you warning email from mdadm's status monitoring about a failed device. This is the drawback of mdadm not having a way to directly do 'remove an active device' as a single action. I can understand why mdadm doesn't have an operation for this, but it's still a bit annoying.

(Looking at this old entry makes it clear that I've run into the need to grow and shrink the number of active mirror devices before, but apparently I didn't consider it noteworthy at that point.)

SoftwareRaidShiftingMirror written at 19:51:05; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.