A ZFS resilver can be almost as good as a scrub, but not quite

April 7, 2019

We do periodic scrubs of our pools, roughly every four weeks on a revolving schedule (we only scrub one pool per fileserver at once, and only over the weekend, so we can't scrub all pools on one of our HD based fileservers in one weekend). However, this weekend scrubbing doesn't happen if there's something else more important happening on the fileserver. Normally there isn't, but one of our iSCSI backends didn't come back up after our power outage this Thursday night. We have spare backends, so we added one in to the affected fileserver and started the process of resilvering everything onto the new backend's disks to restore redundancy to all of our mirrored vdevs.

I've written before about the difference between scrubs and resilvers, which is that a resilver potentially reads and validates less than a scrub does. However, we only have two way mirrors and we lost one side of all of them in the backend failure, so resilvering all mirrors has to read all of the metadata and data on every remaining device of every pool. At first, I thought that this was fully equivalent to a scrub and thus we had effectively scrubbed all of our pools on that fileserver, putting us ahead of our scrub schedule instead of behind it. Then I realized that it isn't, because resilvering doesn't verify that the newly written data on the new devices is good.

ZFS doesn't have any explicit 'read after write' checks, although it will naturally do some amount of reads from your new devices just as part of balancing reads. So although you know that everything on your old disks is good, you can't have full confidence that your new disks have correct copies of everything. If something got corrupted on the way to the disk or the disk has a bad spot that wasn't spotted by its electronics, you won't know until it's read back, and the only way to force that is with an explicit scrub.

For our purposes this is still reasonably good. We've at least checked half of every pool, so right now we definitely have one good copy of all of our data. But it's not quite the same as scrubbing the pools and we definitely don't want to reset all of the 'last scrubbed at time X' markers for the pools to right now.

(If you have three or four way mirrors, as we have had in the past, a resilver doesn't even give you this because it only needs to read each piece of data or metadata from one of your remaining N copies.)

Comments on this page:

By Michael at 2019-04-07 06:25:46:

One way around the lack of read-after-write verification that you're discussing is to have ZED initiate a scrub after a resilver finishes. I've done this, and it's worked well the few times it's needed triggering, but of course it means that a scrub can be started at a potentially inopportunate time. If the media is marginal, in principle this can also lead to a resilver-scrub/resilver-scrub/... event chain, though in that case I might want to offline and certainly replace the offending drive anyway. For me, the automation to keep the data known good (or at least known bad, as opposed to unknown bad) is a reasonable trade-off, but even scrubbing my large HD-based pool finishes in 18-24 hours depending on system I/O load. Once a scrub takes a few days or more, that certainly changes things.

Written on 07 April 2019.
« I won't be trying out ZFS's new TRIM support for a while
Why selecting times is still useful even for dashboards that are about right now »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Apr 7 00:47:37 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.