One thing I now really want in ZFS: faster scrubs (and resilvers)

August 7, 2015

One of the little problems of ZFS is that scrubs and resilvers are random IO. I've always known this, but for a long time it hasn't really been important to us; things ran fast enough. As you might guess from me writing an entry about it, this situation is now changing.

We do periodic scrubs of each of our pools for safety, as everyone should; this has historically been reasonably important. Because of their impact on user IO, we only want to do them on weekends (and we only do one at a time on each fileserver). For a long time this was fine, but recently a few of our largest pools have started taking longer and longer to scrub. There are now a couple of pools that basically take the entire weekend to scrub, and that's if we're lucky. In fact the latest scrub of one these pools took three and a half days (and was only completed because Monday was a holiday and then no one complained on Tuesday).

This pool is not hulkingly huge; it's 2.91 TB of allocated space spread across eight pairs of drives. But at that scrub rate it seems pretty clear that we're being killed by random IO; the aggregate scrub data rate was down in the range of a puny 10 Mbytes/sec. Yes, the pool did see activity over the weekend, but not that much activity.

(This pool seems to be an outlier in terms of scrub time. Another pool on the same fileserver with 2.53 TB used across seven pairs of drives took only 27 hours to be scrubbed during its last check.)

One of the ZFS improvements that came with Solaris 11 is sequential resilvering (via), which apparently significantly speeds up resilvering. It's not clear to me if this also speeds up scrubbing, but I'd optimistically hope so. Of course this is only in Solaris 11; I don't think anyone in the Illumos community is currently working on this, and I imagine it's a non-trivial change that would take a decent amount of development effort. Still, I can hope. Faster scrubs are not yet a killer feature for us (we have a few tricks left up our sleeves), but they would be a big improvement for us.

(Faster resilvers by themselves would also be useful, but we fortunately do far fewer resilvers than we do scrubs.)

Written on 07 August 2015.
« Two factor authentication and emergency access to systems
The ARC now seems to work right in ZFS on Linux »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Aug 7 02:32:18 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.