Today on Linux, ZFS is your only real choice for an advanced filesystem

January 5, 2015

Yesterday I wrote about what I consider advanced filesystems are in general, namely filesystems with the minimum feature of checksums so you know when your data has been damaged and ideally with some ability to use redundancy to repair from damage. As far as I know, today on Linux there are only two filesystems that are advanced in this way: btrfs and ZFS, via ZFS on Linux.

(If you don't care about disk checksums, you have lots of choice among perfectly good filesystems. I would just run ext4 unless you had a good reason to know that eg XFS was a better choice in your particular environment; it's what I do and what most people do, so ext4 gets a lot of exercise and attention.)

In theory, you might choose either and you might even default to btrfs as the in-kernel solution. In practice, I believe that you only have one real choice today and that choice is ZFS on Linux. This is not because ZFS might be better than btrfs on a technical level (although I believe it is), it is simply because people keep having problems with btrfs (the latest example I was exposed to was this one). Far too many things I read about btrfs wind up saying stuff like 'it's been stable for a few months since the last problem' or 'I had a problem recently but it wasn't too bad' or the like. Btrfs does not appear to be stable yet and it doesn't appear likely to be stable any time soon; everything I wrote in 2013 about why not to consider btrfs yet still apply.

Btrfs will hopefully someday be one of the filesystems of the future. But it is not the filesystem of today unless you feel very daring. If you want an advanced filesystem today on Linux, your only real option is ZFS on Linux.

Now, ZoL is not perfect. People do still report problems with it from time to time, including kernel memory issues, and you will want to test it in your environment to make sure it works okay. But from all the reports I've read there are plenty of people running it in production in various ways (in more demanding circumstances than mine) and it isn't blowing up in their faces.

In short, ZFS on Linux is something that you can reasonably consider today, and in practice things will probably work fine. I think that considering btrfs today is demonstrably relatively crazy.

(I'm aware that Facebook is using btrfs internally to some degree. Facebook also has Chris Mason working for them to find and fix their btrfs problems and likely a team that immediately packages those changes up into custom Facebook kernels. See also.)


Comments on this page:

FWIW, XFS added metadata checksums. Data blocks are not checksumed. The idea is that data blocks should be checksumed by applications that care enough to checksum their data in whatever format makes sense for them. Of course, XFS relies on RAID for error correction so it's far from a replacement for zfs, et. al.

By Anon at 2015-01-05 17:21:44:

To follow up to the XFS comment, recent Ext4 also has the option for metadata only checksums - https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums .

By cks at 2015-01-05 17:33:20:

I don't want to say too many nasty things because metadata checksums are better than no checksums, but the metadata is not what I really care about in the filesystem and it is also not the thing most likely to get damaged by random decay in most filesystems (since in most filesystems far more blocks are used for data than metadata). And XFS's apparent attitude that if you really care you'll do data checksums yourself makes me kind of angry; to put it very unkindly, it's such a smug Unix weenie answer.

(Of course one of the problems that all of these efforts have is that doing checksums right almost certainly requires aggressive changes to the format of on-disk data. Changing on disk filesystem format is not exactly popular.)

By Rich Freeman at 2015-01-08 06:41:05:

Having looked at zfs and btrfs and ceph at various points I still think that btrfs is the most promising filesystem right now for individual linux systems. I tend to agree with your statement that it represents the future more than the present - it is far from stable at the moment, but usable if you don't mind dealing with the occasional issue.

The biggest issue I see with zfs for smaller systems is that it lacks the ability to modify a vdev. If you have 40 drives in an array then adding/removing them in groups isn't a big deal, and the fine-grained manual control over how those disks are used is probably welcome. However, if you have 2 drives in an array being able to add 1 more and get maximum value from its additional space is quite useful, as is the fact that you can just toss 3 drives in a RAID1 without thinking and have btrfs get n/2 space utilization without any effort. It is also useful to be able to switch between raid levels on-the-fly.

Obviously ceph isn't really designed with individual systems in mind, but the main thing that concerns me with it is that it lacks the checksumming that both zfs and btrfs provide. As far as I'm aware it does checksum data in transit, but not at rest. I'm not sure under what circumstances it will even discover a discrepancy (without an explicit scrub), but if one is discovered all it does is treat one copy as the official one by convention. I think this is a major step backwards.

For application data storage being non-POSIX is probably fine, but it isn't terribly practical for running your OS.

By cks at 2015-01-08 13:24:05:

ZFS is not perfect and feature-complete by any means, and being able to reshape your storage is probably the most requested missing feature. However I'm not sure that btrfs's ability to do that makes up for what I consider other questionable core design decisions.

(Of course in an ideal world we'd have something that combines the best attributes of both. Sadly this is not such a world and I'm not convinced that btrfs will ever pick up ZFS's good technical decisions any more than ZFS will pick up btrfs's appealing features. If I have to pick one set, I'm more inclined to ZFS's set than btrfs's.)

On XFS versus ext4 for a "vanilla" sensible default: I found when trying to host Mac OS X files that the default inode size for ext4 was not large enough to hold the quantity of extended attributes that OSX uses for some of its files (either stuff inside app bundles or things in their Photos app, I forget which). XFS appears to have dynamic allocation for the bit of metadata where xattrs live so doesn't suffer this problem. You can set the static size reserved for xattrs for extX at FS creation time. (All this off the top of my head.)

Oh er PS thanks for continuing the 'state of ZFS versus BTRFS' theme. I for one really appreciate it.

Written on 05 January 2015.
« What makes a 'next generation' or 'advanced' modern filesystem, for me
Choices filesystems make about checksums »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 5 02:25:21 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.