What modern filesystems need from volume management

April 18, 2014

One of the things said about modern filesystems like btrfs and ZFS is that their volume management functionality is a layering violation; this view holds that filesystems should stick to filesystem stuff and volume managers should stick to that. For the moment let's not open that can of worms and just talk about what (theoretical) modern filesystems need from an underlying volume management layer.

Arguably the crucial defining aspect of modern filesystems like ZFS and btrfs is a focus on resilience against disk problems. A modern filesystem no longer trusts disks not to have silent errors; instead it checksums everything so that it can at least detect data faults and it often tries to create some internal resilience by duplicating metadata or at least spreading it around (copy on write is also common, partly because it gives resilience a boost).

In order to make checksums useful for healing data instead of just simply detecting when it's been corrupted, a modern filesystem needs an additional operation from any underlying volume management layer. Since the filesystem can actually identify the correct block from a number of copies, it needs to be able to get all copies or variations of a set of data blocks from the underlying volume manager (and then be able to tell the volume manager which is the correct copy). In mirroring this is straightforward; in RAID 5 and RAID 6 it gets a little more complex. This 'all variants' operation will be used both during regular reads if a corrupt block is detected and during a full verification check where the filesystem will deliberately read every copy to check that they're all intact.

(I'm not sure what the right primitive operation here should be for RAID 5 and RAID 6. On RAID 5 you basically need the ability to try all possible reconstructions of a stripe in order to see which one generates the correct block checksum. Things get even more convoluted if the filesystem level block that you're checksumming spans multiple stripes.)

Modern filesystems generally also want some way of saying 'put A and B on different devices or redundancy clusters' in situations where they're dealing with stripes of things. This enables them to create multiple copies of (important) metadata on different devices for even more protection against read errors. This is not as crucial if the volume manager is already providing redundancy.

This level of volume manager support is a minimum level, as it still leaves a modern filesystem with the RAID-5+ rewrite hole and a potentially inefficient resynchronization process. But it gets you the really important stuff, namely redundancy that will actually help you against disk corruption.

Written on 18 April 2014.
« Partly getting around NFS's concurrent write problem
Cross-system NFS locking and unlocking is not necessarily fast »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Apr 18 02:17:29 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.