Sequential scrubs and resilvers are coming for (open-source) ZFS

November 25, 2017

Oracle has made a number of changes and improvements to Solaris ZFS since they took it closed source. Mostly I've been indifferent to their changes, but the one improvement I've long envied is their sequential resilvering (and scrubbing) (this apparently first appeared in Solaris 11.2, per here and here). That ZFS scrubs and resilvers aren't sequential has long been a quiet pain point for a lot of people. Apparently it's especially bad for RAID-Z pools (perhaps because of the usual RAID-Z random read issue), but it's been an issue for us in the past with mirrors (although we managed to speed that up).

Well, there's great news here for all open source ZFS implementations, including Illumos distributions, because an implementation of sequential scrubs and resilvers just landed in ZFS on Linux in this commit (apparently it'll be included in ZoL 0.8 whenever that's released). The ZFS on Linux work was done by Tom Caputi of Datto, building on work done by Saso Kiselkov of Nexenta. Saso Kiselkov's work was presented at the 2016 OpenZFS developer summit and got an OpenZFS wiki summary page; Tom Caputi presented at the 2017 summit. Both have slides (and talk videos) if you want more information on how this works.

(It appears that the Nexenta work may be 'NEX-6068', included in NexentaStor 5.0.3. I can't find a current public source tree for Nexenta, so I don't know anything more than that.)

For how it works, I'll just quote from the commit message:

This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance.

My early experience with this in the current ZoL git tree has been very positive. I saw a single-vdev mirror pool on HDs with 293 GB used go from a scrub time of two hours and 25 minutes to one hour and ten minutes.

Although this is very early days for this feature even in ZFS on Linux, I'd expect it to get pushed (or pulled) upstream later and thus go into Illumos. I have no idea when that might happen; it might be reasonable to wait until ZFS on Linux has included it in an actual release so that it sees some significant testing in the field. Or people could find this an interesting and important enough change that they actively work to bring it upstream, if only for testing there.

(At this point I haven't spotted any open issues about this in the Illumos issue tracker, but as mentioned I don't really expect that yet unless someone wants to get a head start.)

PS: Unlike Oracle's change for Solaris 11.2, which apparently needed a pool format change (Oracle version 35, according to Wikipedia), the ZFS on Linux implementation needs no new pool feature and so is fully backward compatible. I'd expect this to be true for any eventual Illumos version unless people find some hard problem that forces the addition of a new pool feature.

Comments on this page:

I believe OpenZFS has an outstanding policy of changes going in only after they've had at least a year of production testing, as an effort to try and weed out bugs.

Written on 25 November 2017.
« Shooting myself in the foot by using exec in a shell script
One way of capturing debugging state information in a systemd-based system »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Nov 25 00:08:24 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.