Wandering Thoughts archives

2017-06-14

The difference between ZFS scrubs and resilvers

At one level, asking what the difference is between scrubs and resilvers sounds silly; resilvering is replacing disks, while scrubbing is checking disks. But if you look at the ZFS code in the kernel things become much less clear, because both scrubs and resilvers use a common set of code to do all their work and it's actually not at all easy to tell what happens differently between them. Since I have actually dug through this section of ZFS code just the other day, I want to write down what the differences are while I remember them.

Both scrubs and resilvers traverse all of the metadata in the pool (in a non-sequential order), and both wind up reading all data. However, scrubs do this more thoroughly for mirrored vdevs; scrubs read all sides of a mirror, while resilvers only read one copy of the data (well, one intact copy). On raidz vdevs there is no difference here, as both scrubs and resilvers read both the data blocks and the parity blocks. This implies that a scrub on mirrored vdevs does more IO and (importantly) more checking than a resilver does. After a resilver of mirrored vdevs, you know that you have at least one intact copy of every piece of the pool, while after an error-free scrub of mirrored vdevs, you know that all ZFS metadata and data on all disks is fully intact.

For resilvers but not scrubs (at least normally), ZFS will sort of attempt to write everything back to the disks again, as I covered in the sidebar of last entry. As far as I can tell from the code, ZFS always skips even trying to write things back to disks that are either known to have good data (for example they were read from) or that are believed to be good because their DTL says that the data is clean on them (I believe that this case only really applies for mirrored vdevs). For scrubs, the only time ZFS submits writes to the disks is if it detects actual errors.

(Although I don't entirely understand how you get into this situation, it appears that a scrub of a pool with a 'replacing' disk behaves a lot like a resilver as far as that disk is concerned. As you would expect and want, the scrub doesn't try to read from the new disk and, like resilvers, it tries to write everything back to the disk.)

Since we only have mirrored vdevs on our fileservers, what really matters to us here is the difference in what gets read between scrubs and resilvers. On the one hand, resilvers put less read load on the disks, which is good for reducing their impact. On the other hand, resilvering isn't quite as thorough a check of the pool's total health as a scrub is.

PS: I'm not sure if either a scrub or a resilver reads the ZIL. Based on some code in zil.c, I suspect that it's checked only during scrubs, which would make sense. Alternately, zil.c is reusing ZIO_FLAG_SCRUB for non-scrub IO for some of its side effects, but that would be a bit weird.

ZFSResilversVsScrubs written at 00:44:26; Add Comment

2017-06-12

Resilvering multiple disks at once in a ZFS pool adds no real extra overhead

Suppose, not entirely hypothetically, that you have a multi-disk, multi-vdev ZFS pool (in our case using mirrored vdevs) and you need to migrate this pool from one set of disks to another set of disks. If you care about doing relatively little damage to pool performance for the duration of the resilvering, are you better off replacing one disk at a time (with repeated resilvers of the pool), or doing a whole bunch of disk replacements at once in one big resilver?

As far as we can tell from both the Illumos ZFS source code and our experience, the answer is that replacing multiple disks at once in a single pool is basically free (apart from the extra IO implied by writing to multiple new disks at once). In particular, a ZFS resilver on mirrored vdevs appears to always read all of the metadata and data in the pool, regardless of how many replacement drives there are and where they are. This means that replacing (or resilvering) multiple drives at once doesn't add any extra read IO; you do the same amount of reads whether you're replacing one drive or ten.

(This is unlike conventional RAID10, where replacing multiple drives at once will probably add additional read IO, which will affect array performance.)

For metadata, this is not particularly surprising. Since metadata is stored all across the pool, metadata located in one vdev can easily point to things located on another one. Given this, you have to read all metadata in the pool in order to even find out what is on a disk that's being resilvered. In theory I think that ZFS could optimize handling leaf data to skip stuff that's known to be entirely on unaffected vdevs; in practice, I can't find any sign in the code that it attempts this optimization for resilvers, and there are rational reasons to skip it anyway.

(As things are now, after a successful resilver you know that the pool has no permanent damage anywhere. If ZFS optimized resilvers by skipping reading and thus checking data on theoretically unaffected vdevs, you wouldn't have this assurance; you'd have to resilver and then scrub to know for sure. Resilvers are somewhat different from scrubs, but they're close.)

This doesn't mean that replacing multiple disks at once won't have any impact on your overall system, because your system may have overall IO capacity limits that are affected by adding more writes. For example, our ZFS fileservers each have a total write bandwidth of 200 Mbytes/sec across all their ZFS pool disks (since we have two 1G iSCSI networks). At least in theory we could saturate this total limit with resilver write traffic alone, and certainly enough resilver write traffic might delay normal user write traffic (especially at times of high write volume). Of course this is what ZFS scrub and resilver tunables are about, so maybe you want to keep an eye on that.

(This also ignores any potential multi-tenancy issues, which definitely affect us at least some of the time.)

ZFS does optimize resilvering disks that were only temporarily offline, using what ZFS calls a dirty time log. The DTL can be used to optimize walking the ZFS metadata tree in much the same way as how ZFS bookmarks work.

Sidebar: How ZFS resilvers (seem to) repair data, especially in RAIDZ

If you look at the resilvering related code in vdev_mirror.c and especially vdev_raidz.c, how things actually get repaired seems pretty mysterious. Especially in the RAIDZ case, ZFS appears to just go off and issue a bunch of ZFS writes to everything without paying any attention to new disks versus old, existing disks. It turns out that the important magic seems to be in zio_vdev_io_start in zio.c, where ZFS quietly discards resilver write IOs to disks unless the target disk is known to require it. The best detailed explanation is in the code's comments:

/*
 * If this is a repair I/O, and there's no self-healing involved --
 * that is, we're just resilvering what we expect to resilver --
 * then don't do the I/O unless zio's txg is actually in vd's DTL.
 * This prevents spurious resilvering with nested replication.
 * For example, given a mirror of mirrors, (A+B)+(C+D), if only
 * A is out of date, we'll read from C+D, then use the data to
 * resilver A+B -- but we don't actually want to resilver B, just A.
 * The top-level mirror has no way to know this, so instead we just
 * discard unnecessary repairs as we work our way down the vdev tree.
 * The same logic applies to any form of nested replication:
 * ditto + mirror, RAID-Z + replacing, etc.  This covers them all.
 */

It appears that preparing and queueing all of these IOs that will then be discarded does involve some amount of CPU, memory allocation for internal data structures, and so on. Presumably this is not a performance issue in practice, especially if you assume that resilvers are uncommon in the first place.

(And certainly the code is better structured this way, especially in the RAIDZ case.)

ZFSMultidiskResilversFree written at 22:29:35; Add Comment

By day for June 2017: 12 14; before June; after June.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.