2015-09-11
ZFS scrub rates, speeds, and how fast is fast
Here is a deceptively simple question: how do you know if your ZFS pool is scrubbing fast (or slow)? In fact, what does the speed of a scrub even mean?
The speed of a scrub is reported in the OmniOS 'zpool status' as:
scan: scrub in progress since Thu Sep 10 22:48:31 2015
3.30G scanned out of 60.1G at 33.8M/s, 0h28m to go
0 repaired, 5.49% done
This is reporting the scrub's progress through what 'zpool list'
reports as ALLOC space. For mirrored vdevs, this is the amount
of space used before mirroring overhead; for raidz vdevs, this is
the total amount of disk space used including the parity blocks.
The reported rate is the total cumulative rate, ie it is simply
the amount scanned divided by the time the scrub has taken so far.
If you want the current scan rate, you need to look at the difference
in the amount scanned between two 'zpool status' commands over
some time interval (10 seconds makes for easy math, if the pool is
scanning fast enough to change the 'scanned' figure).
This means that the scan rate means different things and has different maximum speeds on mirrored vdevs and on raidz vdevs. On mirrored vdevs, the scan speed is the logical scan speed; in the best case of entirely sequential IO it will top out at the sequential read speed of a single drive. The extra IO to read from all of the mirrors at once is handled below this level, so if you watch a mirrored vdev that is scrubbing at X MB/sec you'll see that all N of the drives are each going away at more or less X MB/sec. On raidz vdevs, the scan speed is the total physical scan speed of all the vdev's drives added together. If the vdev has N drives each of which can read at X MB/sec, the best case is a scan rate of N*X. If you watch a raidz vdev that is scrubbing at X MB/sec, each drive should be doing roughly X/N MB/sec of reads (at least for full-width raidz stripes).
(All of this assumes that the scrub is the only thing going on in the pool. Any other IO adds to the read rates you'll see on the disks themselves. An additional complication is that scrubs normally attempt to prefetch things like the data blocks for directories; this IO is not accounted for in the scrub rate but it will be visible if you're watching the raw disks.)
In a multi-vdev pool, it's possible (but not certain) for a scrub to be reading from multiple vdevs at once. If it is, the reported scrub rate will be the sum of the (reported) rates that the scrub can achieve on each vdev. I'm not going to try to hold forth on the conditions when this is likely, because it depends on a lot of things as far as I can tell from the kernel code. I think it's more likely when you have single objects (files, directories, etc) whose blocks are spread across multiple vdevs.
If your IO system has total bandwidth limits across all disks, this will clamp your maximum scrub speed. For raidz vdevs, the visible scrub rate will be this total bandwidth limit; for mirror vdevs, it will be the limit divided by how many mirrors you have. For example, we have a 200 MByte/sec total read bandwidth limit (since fileservers have two 1GB iSCSI links) and we use two-way mirrored vdevs, so our maximum scrub rate is always going to be around 100 MBytes/sec.
This finally gives us an answer to how you know if your scrub is fast or slow. The fastest rate a raidz scrub can report is your total disk bandwidth across all disks and the fastest rate a mirror scrub can report is your single disk bandwidth times the number of vdevs. If you're reasonably close to this (or if you've hit what you know is your system's overall disk bandwidth limit), the better. The further away from this the worse off you are, either because your scrub has descended into random IO or because you're hitting tunable limits (or both at once for extra fun).
(Much of this also applies to resilvers because scrubs and resilvers share most of the same code, but it gets kind of complicated and I haven't attempted to decode the resilver specific part of the kernel ZFS code.)
Sidebar: How scrubs issue IO (for more complexity)
Scrubs have two sorts of IO they do. For ZFS objects like directories and dnodes, the scrub actually needs to inspect the contents of the disk blocks so it tries to prefetch them and then (synchronously) reads the data through the regular ARC read paths. This IO is normal IO, does not get counted in the scrub progress report, and does not do things like check parity blocks or all mirrored copies. Then for all objects (including directories, dnodes, etc) the scrub issues a special scrub 'read everything asynchronously' read that does check parity, read all mirrors, and so on. It is this read that is counted in the 'amount scanned' stats and can be limited by various tunable parameters. Since this read is being done purely for its side effects, the scrub never waits for it and will issue as many as it can (up to various limits).
If a scrub is not running into any limits on how many of these scrub reads it can do, its ability to issue a flood of them is limited only by whether it has to wait for some disk IO in order to process another directory or dnode or whatever.
2015-09-10
Changing kernel tunables can drastically speed up ZFS scrubs
We had another case where a pool scrub was
taking a very long time this past weekend and week; on a 3 TB pool,
'zpool status' was reporting ongoing scrub rates of under 3 MB/s.
This got us to go on some Internet searches for kernel tunables that
might be able to speed this up. The results proved to be extremely
interesting.
I will cut to the punchline: with one change we got the pool scrubbing
at roughly 100 Mbytes/second, which is the maximum scrub IO rate
a fileserver can maintain at the moment. Also, it turns out that when I blithely
asserted that our scrubs were being killed by
having to do random IO I was almost certainly dead wrong.
(One reason we were willing to try changing tunable parameters on a live production system was that this pool was scrubbing so disastrously slow that we were seriously worried about resilver times for it if it ever needed a disk replacement.)
The two good references we immediately found for tuning ZFS scrubs
and resilvers are this serverfault question and answer
and ZFS: Performance Tuning for Scrubs and Resilvers.
Rather than change all of their recommended parameters at once, I
opted to make one change at a time and observe the effects (just
in case a change caused the server to choke). The first change
I made was to set zfs_scrub_delay to 0; this immediately
accelerated the scrub rate to 100 Mbytes/sec.
Let's start with a quote from the code in dsl_scan.c:
int zfs_scrub_delay = 4; /* number of ticks to delay scrub */ int zfs_scan_idle = 50; /* idle window in clock ticks */
How these variables are used is that every time a ZFS scrub is about
to issue a read request, it checks to see if some normal read or
write IO has happened within zfs_scan_idle ticks. If it has,
it delays zfs_scrub_delay ticks before issuing the IO or doing
anything else. If your pool is sufficiently busy to hit this condition
more or less all of the time, ZFS scrubs will only be able to make at
most a relatively low number of reads a second; if HZ is how many
ticks in a second, the issue rate is HZ / 4 by default.
In standard OmniOS kernels, HZ is almost always 100; that is,
there are 100 ticks a second. If your regular pool users are churning
around enough to do one actual IO every half a second, your scrubs
are clamped to no more than 25 reads a second. If each read is for
a full 128 KB ZFS block, that's a scrub rate of about 3.2 MBytes/sec
at most (and there are other things that can reduce it, too).
Setting zfs_scrub_delay to 0 eliminates this clamping of scrub
reads in the face of other IO; instead your scrub is on much more
equal footing with active user IO. Unfortunately you cannot set it
to any non-zero value lower than 1 tick, and 1 tick will clamp you
to 100 reads a second, which is probably not fast enough for many
people.
This does not eliminate slowness due to scrubs (and resilvers) potentially having to do a lot of random reads, so it will not necessarily eliminate all of your scrub speed problems. But if you have a pool that seems to scrub at a frustratingly variable speed, sometimes scrubbing in a day and sometimes taking all week, you are probably running into ZFS scrubs backing off in the face of other IO and it's worth exploring this tunable and the others in those links.
On the other tunables,
I believe that it's relatively harmless and even useful to tune up
zfs_scan_min_time_ms, zfs_resilver_min_time_ms, and
zfs_top_maxinflight. Certainly I saw no problems on our server
when I set zfs_scan_min_time_ms to 5000 and increased
zfs_top_maxinflight. However I can't say for sure that it's
useful, as our scrub rate had already hit its maximum rate from
just the zfs_scrub_delay change.
(And I'm still reading the current Illumos ZFS kernel code to try to understand what these additional tunables really do and mean.)
Sidebar: How to examine and set these tunable variables
To change kernel tunables like this, you need to use 'mdb -kw'
to enable writing to things. To see their value, I recommend
using '::print', eg:
> zfs_scrub_delay ::print -d 4 > zfs_scan_idle ::print -d 0t50
To set the value, you should to use /W, not the /w that
ZFS: Performance Tuning for Scrubs and Resilvers
says. The w modifier is for 2-byte shorts, not 4-byte ints, and
all of these variables are 4-byte ints (as you can see with '::print
-t' and '::sizeof' if you want). A typical example is:
> zfs_scrub_delay/W0 zfs_scrub_delay:0x4 = 0x0
The /W stuff accepts decimal numbers as '0tNNNN' (as '::print -d'
shows them, unsurprisingly), so you can do things like:
> zfs_scan_min_time_ms/W0t5000 zfs_scan_min_time_ms: 0x3e8 = 0x1388
(Using '/w' will work on the x86 because the x86 is a little-endian
architecture, but please don't get into the habit of doing that. My
person view is that if you're going to be poking values into kernel
memory it's very much worth being careful about doing it right.)
2015-09-06
Optimizing finding unowned files on our ZFS fileservers
One of the things we do every weekend is look for files on our
fileservers that have wound up being owned by
people who don't exist (or, more commonly, who no longer exist). For a
long time this was done with the obvious approach using find, which
was basically this:
SFS=$(... generate FS list ...)
gfind -H $SFS -mount '(' -nogroup -o -nouser ')' -printf ...
The problem with this is that we have enough data in enough filesystems
that running a find over the entire thing can take a significant amount
of time. On our biggest fileserver, we've seen this take on the order of
ten hours, which either delays the start of our weekly pool scrubs or collides with them, slowing them down (and
they can already be slow enough). Recently I
realized that we can do much better than this by not checking most of
our filesystems.
The trick is to use ZFS's existing infrastructure for quotas. As part
of this ZFS maintains information on the amount of space used by every
user and every group on each filesystems, which the 'zfs userspace'
and 'zfs groupspace' commands will print out. As a side effect this
gives you a complete list of every UID and GID that uses space in the
filesystem, so all we have to do is scan the lists to see if there
are any unknown ones in it. If all UIDs and GIDs using space on the
filesystem exist, we can completely skip running find on it; we know
our find won't find anything.
Since our filesystems don't normally have any unowned files on them, this turns into a massive win. In the usual case we won't scan any filesystems on a fileserver, and even if we do scan some we'll generally only scan a handful. It may even make this particular process fast enough so that we can just run it after deleting accounts, instead of waiting for the weekend.
By the way, the presence of unknown UIDs or GIDs in the output of
'zfs *space' doesn't mean that there definitely are files that a
find will pick up. The unowned files could be only in a snapshot,
or they could be deleted files that are being held open by various
things, including the NFS lock manager.