Resilvering multiple disks at once in a ZFS pool adds no real extra overhead

June 12, 2017

Suppose, not entirely hypothetically, that you have a multi-disk, multi-vdev ZFS pool (in our case using mirrored vdevs) and you need to migrate this pool from one set of disks to another set of disks. If you care about doing relatively little damage to pool performance for the duration of the resilvering, are you better off replacing one disk at a time (with repeated resilvers of the pool), or doing a whole bunch of disk replacements at once in one big resilver?

As far as we can tell from both the Illumos ZFS source code and our experience, the answer is that replacing multiple disks at once in a single pool is basically free (apart from the extra IO implied by writing to multiple new disks at once). In particular, a ZFS resilver on mirrored vdevs appears to always read all of the metadata and data in the pool, regardless of how many replacement drives there are and where they are. This means that replacing (or resilvering) multiple drives at once doesn't add any extra read IO; you do the same amount of reads whether you're replacing one drive or ten.

(This is unlike conventional RAID10, where replacing multiple drives at once will probably add additional read IO, which will affect array performance.)

For metadata, this is not particularly surprising. Since metadata is stored all across the pool, metadata located in one vdev can easily point to things located on another one. Given this, you have to read all metadata in the pool in order to even find out what is on a disk that's being resilvered. In theory I think that ZFS could optimize handling leaf data to skip stuff that's known to be entirely on unaffected vdevs; in practice, I can't find any sign in the code that it attempts this optimization for resilvers, and there are rational reasons to skip it anyway.

(As things are now, after a successful resilver you know that the pool has no permanent damage anywhere. If ZFS optimized resilvers by skipping reading and thus checking data on theoretically unaffected vdevs, you wouldn't have this assurance; you'd have to resilver and then scrub to know for sure. Resilvers are somewhat different from scrubs, but they're close.)

This doesn't mean that replacing multiple disks at once won't have any impact on your overall system, because your system may have overall IO capacity limits that are affected by adding more writes. For example, our ZFS fileservers each have a total write bandwidth of 200 Mbytes/sec across all their ZFS pool disks (since we have two 1G iSCSI networks). At least in theory we could saturate this total limit with resilver write traffic alone, and certainly enough resilver write traffic might delay normal user write traffic (especially at times of high write volume). Of course this is what ZFS scrub and resilver tunables are about, so maybe you want to keep an eye on that.

(This also ignores any potential multi-tenancy issues, which definitely affect us at least some of the time.)

ZFS does optimize resilvering disks that were only temporarily offline, using what ZFS calls a dirty time log. The DTL can be used to optimize walking the ZFS metadata tree in much the same way as how ZFS bookmarks work.

Sidebar: How ZFS resilvers (seem to) repair data, especially in RAIDZ

If you look at the resilvering related code in vdev_mirror.c and especially vdev_raidz.c, how things actually get repaired seems pretty mysterious. Especially in the RAIDZ case, ZFS appears to just go off and issue a bunch of ZFS writes to everything without paying any attention to new disks versus old, existing disks. It turns out that the important magic seems to be in zio_vdev_io_start in zio.c, where ZFS quietly discards resilver write IOs to disks unless the target disk is known to require it. The best detailed explanation is in the code's comments:

 * If this is a repair I/O, and there's no self-healing involved --
 * that is, we're just resilvering what we expect to resilver --
 * then don't do the I/O unless zio's txg is actually in vd's DTL.
 * This prevents spurious resilvering with nested replication.
 * For example, given a mirror of mirrors, (A+B)+(C+D), if only
 * A is out of date, we'll read from C+D, then use the data to
 * resilver A+B -- but we don't actually want to resilver B, just A.
 * The top-level mirror has no way to know this, so instead we just
 * discard unnecessary repairs as we work our way down the vdev tree.
 * The same logic applies to any form of nested replication:
 * ditto + mirror, RAID-Z + replacing, etc.  This covers them all.

It appears that preparing and queueing all of these IOs that will then be discarded does involve some amount of CPU, memory allocation for internal data structures, and so on. Presumably this is not a performance issue in practice, especially if you assume that resilvers are uncommon in the first place.

(And certainly the code is better structured this way, especially in the RAIDZ case.)

Written on 12 June 2017.
« Why filing away mailing lists for a while has improved my life
The difference between ZFS scrubs and resilvers »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jun 12 22:29:35 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.