A bit on ZFS's coming raidz expansion and ZFS DVAs

June 20, 2021

The ZFS news of the time interval is Ars Technica's report of raidz expansion potentially being added (via). More details and information about how it works are in the links in Matthew Ahrens' pull request, which as of yet hasn't landed in the master development version. I've previously written about ZFS DVAs and their effects on growing ZFS pools, in which I said that how DVA offsets are defined was by itself a good reason as to why you couldn't expand raidz vdevs (in addition to potential inefficiency). You might wonder how Ahrens' raidz expansion interacts with ZFS DVAs here, so that it can actually work.

As a quick summary, ZFS DVAs (Data Virtual Addresses, the ZFS equivalent of a block number) contain the byte offset of where in the entire vdev your block of data is found. In mirror vdevs (and plain disks), this byte offset is from the start of each disk. In raidz vdevs, it's striped sequentially across all disks; it starts with a chunk of disk 0, goes to a chunk of disk 1, and so on. One of the implications of this is that if you just add a disk to a raidz vdev and do nothing else, all of your striped sequential byte offsets change and you can no longer read your data.

How Ahrens' expansion deals with this is that it reflows all of the data on all of the existing drives to the new, wider raidz vdev layout, moving sectors around as necessary. Some of this reflowed data will wind up on the new drive (starting with the second sector of the first drive), but most of the data will wind up in other places on the existing drives. Both the Ars Technica article and Ahrens' slides from the 2021 FreeBSD Developer Summit have diagrams of this. The slides also share the detail that this is optimized to only copy the live data. This reflowing has the vital property that it preserves all of the DVA byte offsets, since it moves all data sectors from their old locations to where they should be in the new vdev layout.

(Thus, this raidz expansion is done without the long sought and so far mythical 'block pointer rewriting' that would allow general ZFS reshaping, including removing vdevs without the current layer of indirection.)

This copying is performed sector by sector and is blind to ZFS block boundaries. This means that raidz expansion doesn't verify checksums during the process because it doesn't know where they are. Since this expansion writes over the old data locations on your existing drives, I would definitely want to scrub your pool beforehand and have backups (to the extent that it's possible), just in case you hit previously latent disk errors during the expansion. And of course you should scrub the pool immediately after the expansion finishes.

As Ahrens' covers in the slides, this reflowing also doesn't expand the old blocks to be the full new width of the raidz vdev. As a result, they (still) have a higher parity overhead than newly written blocks would. To eliminate this overhead you need to explicitly force ZFS to rewrite all of the data in some way (and obviously this is impossible if you have snapshots that you can't delete and recreate).

Written on 20 June 2021.
« The Unix background of Linux's 'file-max' and nr_open kernel limits on file descriptors
Some notes on building Firefox from source on Ubuntu »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jun 20 00:59:49 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.