ZFS scrub rates, speeds, and how fast is fast

September 11, 2015

Here is a deceptively simple question: how do you know if your ZFS pool is scrubbing fast (or slow)? In fact, what does the speed of a scrub even mean?

The speed of a scrub is reported in the OmniOS 'zpool status' as:

  scan: scrub in progress since Thu Sep 10 22:48:31 2015
    3.30G scanned out of 60.1G at 33.8M/s, 0h28m to go
    0 repaired, 5.49% done

This is reporting the scrub's progress through what 'zpool list' reports as ALLOC space. For mirrored vdevs, this is the amount of space used before mirroring overhead; for raidz vdevs, this is the total amount of disk space used including the parity blocks. The reported rate is the total cumulative rate, ie it is simply the amount scanned divided by the time the scrub has taken so far. If you want the current scan rate, you need to look at the difference in the amount scanned between two 'zpool status' commands over some time interval (10 seconds makes for easy math, if the pool is scanning fast enough to change the 'scanned' figure).

This means that the scan rate means different things and has different maximum speeds on mirrored vdevs and on raidz vdevs. On mirrored vdevs, the scan speed is the logical scan speed; in the best case of entirely sequential IO it will top out at the sequential read speed of a single drive. The extra IO to read from all of the mirrors at once is handled below this level, so if you watch a mirrored vdev that is scrubbing at X MB/sec you'll see that all N of the drives are each going away at more or less X MB/sec. On raidz vdevs, the scan speed is the total physical scan speed of all the vdev's drives added together. If the vdev has N drives each of which can read at X MB/sec, the best case is a scan rate of N*X. If you watch a raidz vdev that is scrubbing at X MB/sec, each drive should be doing roughly X/N MB/sec of reads (at least for full-width raidz stripes).

(All of this assumes that the scrub is the only thing going on in the pool. Any other IO adds to the read rates you'll see on the disks themselves. An additional complication is that scrubs normally attempt to prefetch things like the data blocks for directories; this IO is not accounted for in the scrub rate but it will be visible if you're watching the raw disks.)

In a multi-vdev pool, it's possible (but not certain) for a scrub to be reading from multiple vdevs at once. If it is, the reported scrub rate will be the sum of the (reported) rates that the scrub can achieve on each vdev. I'm not going to try to hold forth on the conditions when this is likely, because it depends on a lot of things as far as I can tell from the kernel code. I think it's more likely when you have single objects (files, directories, etc) whose blocks are spread across multiple vdevs.

If your IO system has total bandwidth limits across all disks, this will clamp your maximum scrub speed. For raidz vdevs, the visible scrub rate will be this total bandwidth limit; for mirror vdevs, it will be the limit divided by how many mirrors you have. For example, we have a 200 MByte/sec total read bandwidth limit (since fileservers have two 1GB iSCSI links) and we use two-way mirrored vdevs, so our maximum scrub rate is always going to be around 100 MBytes/sec.

This finally gives us an answer to how you know if your scrub is fast or slow. The fastest rate a raidz scrub can report is your total disk bandwidth across all disks and the fastest rate a mirror scrub can report is your single disk bandwidth times the number of vdevs. If you're reasonably close to this (or if you've hit what you know is your system's overall disk bandwidth limit), the better. The further away from this the worse off you are, either because your scrub has descended into random IO or because you're hitting tunable limits (or both at once for extra fun).

(Much of this also applies to resilvers because scrubs and resilvers share most of the same code, but it gets kind of complicated and I haven't attempted to decode the resilver specific part of the kernel ZFS code.)

Sidebar: How scrubs issue IO (for more complexity)

Scrubs have two sorts of IO they do. For ZFS objects like directories and dnodes, the scrub actually needs to inspect the contents of the disk blocks so it tries to prefetch them and then (synchronously) reads the data through the regular ARC read paths. This IO is normal IO, does not get counted in the scrub progress report, and does not do things like check parity blocks or all mirrored copies. Then for all objects (including directories, dnodes, etc) the scrub issues a special scrub 'read everything asynchronously' read that does check parity, read all mirrors, and so on. It is this read that is counted in the 'amount scanned' stats and can be limited by various tunable parameters. Since this read is being done purely for its side effects, the scrub never waits for it and will issue as many as it can (up to various limits).

If a scrub is not running into any limits on how many of these scrub reads it can do, its ability to issue a flood of them is limited only by whether it has to wait for some disk IO in order to process another directory or dnode or whatever.

Written on 11 September 2015.
« Changing kernel tunables can drastically speed up ZFS scrubs
How NFS deals with the pending delete problem »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 11 00:20:27 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.