Where I feel that btrfs went wrong

April 16, 2014

I recently finished reading this LWN series on btrfs, which was the most in-depth exposure at the details of using btrfs that I've had so far. While I'm sure that LWN intended the series to make people enthused about btrfs, I came away with a rather different reaction; I've wound up feeling that btrfs has made a significant misstep along its way that's resulted in a number of design mistakes. To explain why I feel this way I need to contrast it with ZFS.

Btrfs and ZFS are each both volume managers and filesystems merged together. One of the fundamental interface differences between them is that ZFS has decided that it is a volume manager first and a filesystem second, while btrfs has decided that it is a filesystem first and a volume manager second. This is what I see as btrfs's core mistake.

(Overall I've been left with the strong impression that btrfs basically considers volume management to be icky and tries to have as little to do with it as possible. If correct, this is a terrible mistake.)

Since it's a volume manager first, ZFS places volume management front and center in operation. Before you do anything ZFS-related, you need to create a ZFS volume (which ZFS calls a pool); only once this is done do you really start dealing with ZFS filesystems. ZFS even puts the two jobs in two different commands (zpool for pool management, zfs for filesystem management). Because it's firmly made this split, ZFS is free to have filesystem level things such as df present a logical, filesystem based view of things like free space and device usage. If you want the actual physical details you go to the volume management commands.

Because btrfs puts the filesystem first it wedges volume creation in as a side effect of filesystem creation, not a separate activity, and then it carries a series of lies and uselessly physical details through to filesystem level operations like df. Consider the the discussion of what df shows for a RAID1 btrfs filesystem here, which has both a lie (that the filesystem uses only a single physical device) and a needlessly physical view (of the physical block usage and space free on a RAID 1 mirror pair). That btrfs refuses to expose itself as a first class volume manager and pretends that you're dealing with real devices forces it into utterly awkward things like mounting a multi-device btrfs filesystem with 'mount /dev/adevice /mnt'.

I think that this also leads to the asinine design decision that subvolumes have magic flat numeric IDs instead of useful names. Something that's willing to admit it's a volume manager, such as LVM or ZFS, has a name for the volume and can then hang sub-names off that name in a sensible way, even if where those sub-objects appear in the filesystem hierarchy (and under what names) gets shuffled around. But btrfs has no name for the volume to start with and there you go (the filesystem-volume has a mount point, but that's a different thing).

All of this really matters for how easily you can manage and keep track of things. df on ZFS filesystems does not lie to me; it tells me where the filesystem comes from (what pool and what object path within the pool), how much logical space the filesystem is using (more or less), and roughly how much more I can write to it. Since they have full names, ZFS objects such as snapshots can be more or less self documenting if you name them well. With an object hierarchy, ZFS has a natural way to inherit various things from parent object to sub-objects. And so on.

Btrfs's 'I am not a volume manager' approach also leads it to drastically limit the physical shape of a btrfs RAID array in a way that is actually painfully limiting. In ZFS, a pool stripes its data over a number of vdevs and each vdev can be any RAID type with any number of devices. Because ZFS allows multi-way mirrors this creates a straightforward way to create a three-way or four-way RAID 10 array; you just make all of the vdevs be three or four way mirrors. You can also change the mirror count on the fly, which is handy for all sorts of operations. In btrfs, the shape 'raid10' is a top level property of the overall btrfs 'filesystem' and, well, that's all you get. There is no easy place to put in multi-way mirroring; because of btrfs's model of not being a volume manager it would require changes in any number of places.

(And while I'm here, that btrfs requires you to specify both your data and your metadata RAID levels is crazy and gives people a great way to accidentally blow their own foot off.)

As a side note, I believe that btrfs's lack of allocation guarantees in a raid10 setup makes it impossible to create a btrfs filesystem split evenly across two controllers that is guaranteed to survive the loss of one entire controller. In ZFS this is trivial because of the explicit structure of vdevs in the pool.

PS: ZFS is too permissive in how you can assemble vdevs, because there is almost no point of a pool with, say, a mirror vdev plus a RAID-6 vdev. That configuration is all but guaranteed to be a mistake in some way.

Written on 16 April 2014.
« Chasing SSL certificate chains to build a chain file
Partly getting around NFS's concurrent write problem »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Apr 16 01:27:57 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.