What makes a 'next generation' or 'advanced' modern filesystem, for me

January 4, 2015

Filesystems have been evolving in fits and starts for roughly as long as there have been filesystems, and I doubt that is going to stop any time soon. These days there are a number of directions that filesystems seem to be moving in, but I've come around to the view that one of them is of particular importance and is the defining characteristic of what I wind up calling 'modern', 'advanced', or 'next generation' filesystems.

By now, current filesystems have mostly solved the twin problems of performance and resilience in the face of crashes (although performance may need some re-solving in the face of SSDs, which change various calculations). Future filesystems will likely make incremental improvements, but I can't currently imagine anything drastically different.

Instead, the next generation frontier is in resilience to disk problems and improved recovery from them. At the heart of this is two things. First, a steadily increased awareness that when you write something to disk (either HD or SSD), you are not absolutely guaranteed to either get it back intact or get an error. Oh, the disk drive and everything involved will try hard, but there are a lot of things that can go wrong and especially over long amounts of time. Second, that the rate at which these problems happen has not really been going down over time. Instead they've actually been going up, because the most common models are based on a chance of error per so much data and the amount of data we store and use has kept going up and up.

The pragmatic result is that an increasing amount of people are starting to worry about quiet data loss, feel that the possibility of it goes up over time, and want to have some way to deal with it and fix things. It doesn't help that we're collectively storing more and more important things on disks (hopefully with backups, yes yes) instead of other media.

The dominant form that meeting this need is taking right now is checksums of everything on disk and filesystems that are aware of what's really happening in volume management. The former creates resilience (at least you can notice that something has gone wrong) and the latter aids recovery from it (since disk redundancy is one source of intact copies of the corrupted data, and a good idea anyways since whole disks can die).

(In this entry I'm talking only about local filesystems. There is a whole different evolutionary process going on in multi-node filesystems and multi-node object stores (that may or may not have a more or less POSIX filesystem layer on top of). And I'm not even going to think about various sorts of distributed databases that hold increasingly large amounts of data for large operations.)

PS: Part of my bias here is that resilience is what I've come to personally care about. One reason for this is that other filesystem attributes are pragmatically good enough and not subject to obvious inefficiencies and marvelous improvements (except for performance through SSDs), and another reason is that storage is now big enough and cheap enough that it's perfectly reasonable to store extra data (sometimes a lot of extra data, eg disk mirrors) to help insure that you can get your files back later.

Comments on this page:

By Colin at 2015-01-05 02:59:08:

Hmm, I generally think of a more advance filesystem as one that will work with speed issues (ok mainly hdd) and try to minimise excessive SSD writes.

"since disk redundancy is one source of intact copies of the corrupted data, and a good idea anyways since whole disks can die"

--- AHhh. Disk redundancy is a little more tricky then this. Think about it, you have bitwise corruption, you encode to both disks then read off the disks. If the data is corrupt at that point you may be reading corrupt data. It's why allot of people HATE Raid 5 (and some controllers will do checksums to confirm that the data was written to the disk correctly) if you have 1x corrupt disk, then incorrect checksums will be created over time and you are likely to loose the array.

Also there is the old "OS hands data to storage medium, storage medium does it's best effort but raises error" problem.

By Marc Gerges at 2015-01-08 04:29:05:

Reading your praise, I started reading up on ZFS. LVM on RAID is running here, in a little home media server setup, so the interest may be different from a datacenter setup.

I very much like how you can throw around filesystems, that'd be a huge advantage compared to my current setup. The idea of having the pool space available for all datasets, and that one can snapshot/clone/move stuff around easily for backups is very sweet.

What I am missing is the same kind of flexibility on the backend of it - why can I not just throw HDD's and SDD's on it, tell it that over the entire pool (or individual datasets) I want a given redundancy, and let it take care of how to set it up. The idea that I have to manage that by myself seems not so much of a progress compared to LVM on RAID...

Written on 04 January 2015.
« The effects of our fileserver multi-tenancy
Today on Linux, ZFS is your only real choice for an advanced filesystem »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jan 4 02:35:15 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.