Where I feel that btrfs went wrong

April 16, 2014

I recently finished reading this LWN series on btrfs, which was the most in-depth exposure at the details of using btrfs that I've had so far. While I'm sure that LWN intended the series to make people enthused about btrfs, I came away with a rather different reaction; I've wound up feeling that btrfs has made a significant misstep along its way that's resulted in a number of design mistakes. To explain why I feel this way I need to contrast it with ZFS.

Btrfs and ZFS are each both volume managers and filesystems merged together. One of the fundamental interface differences between them is that ZFS has decided that it is a volume manager first and a filesystem second, while btrfs has decided that it is a filesystem first and a volume manager second. This is what I see as btrfs's core mistake.

(Overall I've been left with the strong impression that btrfs basically considers volume management to be icky and tries to have as little to do with it as possible. If correct, this is a terrible mistake.)

Since it's a volume manager first, ZFS places volume management front and center in operation. Before you do anything ZFS-related, you need to create a ZFS volume (which ZFS calls a pool); only once this is done do you really start dealing with ZFS filesystems. ZFS even puts the two jobs in two different commands (zpool for pool management, zfs for filesystem management). Because it's firmly made this split, ZFS is free to have filesystem level things such as df present a logical, filesystem based view of things like free space and device usage. If you want the actual physical details you go to the volume management commands.

Because btrfs puts the filesystem first it wedges volume creation in as a side effect of filesystem creation, not a separate activity, and then it carries a series of lies and uselessly physical details through to filesystem level operations like df. Consider the the discussion of what df shows for a RAID1 btrfs filesystem here, which has both a lie (that the filesystem uses only a single physical device) and a needlessly physical view (of the physical block usage and space free on a RAID 1 mirror pair). That btrfs refuses to expose itself as a first class volume manager and pretends that you're dealing with real devices forces it into utterly awkward things like mounting a multi-device btrfs filesystem with 'mount /dev/adevice /mnt'.

I think that this also leads to the asinine design decision that subvolumes have magic flat numeric IDs instead of useful names. Something that's willing to admit it's a volume manager, such as LVM or ZFS, has a name for the volume and can then hang sub-names off that name in a sensible way, even if where those sub-objects appear in the filesystem hierarchy (and under what names) gets shuffled around. But btrfs has no name for the volume to start with and there you go (the filesystem-volume has a mount point, but that's a different thing).

All of this really matters for how easily you can manage and keep track of things. df on ZFS filesystems does not lie to me; it tells me where the filesystem comes from (what pool and what object path within the pool), how much logical space the filesystem is using (more or less), and roughly how much more I can write to it. Since they have full names, ZFS objects such as snapshots can be more or less self documenting if you name them well. With an object hierarchy, ZFS has a natural way to inherit various things from parent object to sub-objects. And so on.

Btrfs's 'I am not a volume manager' approach also leads it to drastically limit the physical shape of a btrfs RAID array in a way that is actually painfully limiting. In ZFS, a pool stripes its data over a number of vdevs and each vdev can be any RAID type with any number of devices. Because ZFS allows multi-way mirrors this creates a straightforward way to create a three-way or four-way RAID 10 array; you just make all of the vdevs be three or four way mirrors. You can also change the mirror count on the fly, which is handy for all sorts of operations. In btrfs, the shape 'raid10' is a top level property of the overall btrfs 'filesystem' and, well, that's all you get. There is no easy place to put in multi-way mirroring; because of btrfs's model of not being a volume manager it would require changes in any number of places.

(And while I'm here, that btrfs requires you to specify both your data and your metadata RAID levels is crazy and gives people a great way to accidentally blow their own foot off.)

As a side note, I believe that btrfs's lack of allocation guarantees in a raid10 setup makes it impossible to create a btrfs filesystem split evenly across two controllers that is guaranteed to survive the loss of one entire controller. In ZFS this is trivial because of the explicit structure of vdevs in the pool.

PS: ZFS is too permissive in how you can assemble vdevs, because there is almost no point of a pool with, say, a mirror vdev plus a RAID-6 vdev. That configuration is all but guaranteed to be a mistake in some way.


Comments on this page:

By Ewen McNeill at 2014-04-16 06:28:32:

Reading your description of btrfs I'm struck by the thought that it's suffering from a combination of "leaky abstractions" (ie, insufficient encapsulation of "lower" level building blocks) and being designed in an environment where there was already a de facto volume manager (LVM) -- so the "volume management" was bolted on as a bit of a "this would make some things more flexible" after thought. I too have the sense that btrfs is "a file system with some volume management stuff integrated for 'more flexibility'" rather than a volume manager/file system pair.

FWIW, based on listening to a bunch of talks at conferences by developers of Linux file systems, my conclusion was that XFS was receiving the most promising attention to detail/real world problems -- so I've gone back to that for several more recent things (combined with LVM). (Mostly for licensing reasons ZFS doesn't feel like a realistic option for most of the environments I deal with -- almost exclusively variants of Linux now -- so I've never considered it a sensible default choice. But the design has always seemed very well thought out -- right back to the original developers blog posts about it.)

Ewen

By Etienne Dechamps at 2014-04-16 08:36:20:

"PS: ZFS is too permissive in how you can assemble vdevs, because there is almost no point of a pool with, say, a mirror vdev plus a RAID-6 vdev. That configuration is all but guaranteed to be a mistake in some way."

In recent versions of ZFS (such as in ZFS On Linux), the zpool command will prevent you from doing that and will require the use of the -f (force) flag for the command to go through.

By Ray V at 2014-04-16 11:41:44:

Despite being a long time ZFS user and fan, I will say I'm excited about Facebook beginning more broad use of btrfs. That's exactly the sort of real-world exposure it needs to work out remaining kinks and even some of the usability issues you describe (though I imagine the less-than-clean volume management stuff won't change anytime soon).

By Fulano de tal at 2014-04-16 12:51:46:

I think you are confusing "btrfs features that have not been implemented" by "btrfs core mistakes". Sure, btrfs subvolume management is not as advanced as ZFS's. But although slowly, btrfs is adding infrastructure that will eventually allow saner handling of subvolumes. In fact, in the most recent release added this: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=63541927c8d11d2686778b1e8ec71c14b4fd53e4

Which will certainly allow in the future do many the things you mention in your post. Btrfs is quite flexible internally, there are many features that will be eventually implemented but aren't right now. Remember that raid5/6 implementation isn't even complete today.

The priority of the btrfs development team seems to be about making the "filesystem parts" reliable instead of focusing in extended device manager functionality. I think has a lot of sense - a reliable filesystem without advanced volume manager facilities is useful and can be used as the default FS in a Linux distro, an advanced volume manager with unstable "filesystem functionality" is useless.

One of the fundamental interface differences between them is that ZFS has decided that it is a volume manager first and a filesystem second, while btrfs has decided that it is a filesystem first and a volume manager second. This is what I see as btrfs's core mistake.

The volume management of btrfs came much later, and was originally not planned for it all. One of the Linux folks thought it a "rampant layering violation", and that 'accusation' caused Jeff Bonwick (co-creator of ZFS) to post this response to the logic of why Sun did things the way they did (SPA+DMU+ZPL):

By rektide at 2014-04-21 00:20:00:

And then there's HammerFS, where there is no concept of subvolumes: any directory from any time point can be turned into a snapshot. ZFS still requires planning ahead, requires thinking in terms of subvolumes, and while it eases the path of use, it's still a far cry from being able to version anything.

It may have been Bonwick, or another ZFS author, that blogged about the paper "End to end arguments in system design" some years back: http://web.mit.edu/saltzer/www/publications/endtoend/endtoend.pdf

This really made me reconsider the knee-jerk 'encapsulation!' argument whenever a situation like this comes along.

From 76.10.142.143 at 2014-04-25 08:16:54:

It may have been Bonwick [….]

Is this the article you were referring to?

By scineram@freemail.hu at 2017-12-14 06:02:33:

"forces it into utterly awkward things like mounting a multi-device btrfs filesystem with 'mount /dev/adevice /mnt'"

How does that even work by the way? Is it some dummy device you have to specify like mdraid? Or just giving one device of many pulls the others in automagically?

By cks at 2017-12-14 08:19:47:

According to LWN (in the Working with multiple devices article), you mount a multi-device btrfs filesystem using any component device and it magically finds all the other devices. So if you have a btrfs filesystem that uses sdb1 and sdc1, you can say 'mount /dev/sdb1 /mnt' or 'mount /dev/sdc1 /mnt'.

(I don't know how this works in a modern Linux world that prefers to mount things using UUIDs and other permanent identifiers, partly because I've never used btrfs myself.)

Written on 16 April 2014.
« Chasing SSL certificate chains to build a chain file
Partly getting around NFS's concurrent write problem »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Apr 16 01:27:57 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.