== Changing kernel tunables can drastically speed up ZFS scrubs We had [[another case ZFSFasterScrubsDesire]] where a pool scrub was taking a very long time this past weekend and week; on a 3 TB pool, '_zpool status_' was reporting ongoing scrub rates of under 3 MB/s. This got us to go on some Internet searches for kernel tunables that might be able to speed this up. The results proved to be extremely interesting. I will cut to the punchline: with one change we got the pool scrubbing at roughly 100 Mbytes/second, which is the maximum scrub IO rate [[a fileserver ZFSFileserverSetupII]] can maintain [[at the moment OmniOS10GIntelProblems]]. Also, it turns out that when I [[blithely asserted ZFSFasterScrubsDesire]] that our scrubs were being killed by having to do random IO I was almost certainly dead wrong. (One reason we were willing to try changing tunable parameters on a live production system was that this pool was scrubbing so disastrously slow that we were seriously worried about resilver times for it if it ever needed a disk replacement.) The two good references we immediately found for tuning ZFS scrubs and resilvers are [[this serverfault question and answer https://serverfault.com/questions/499739/tuning-zfs-scrubbing-141kb-s-running-for-15-days]] and [[ZFS: Performance Tuning for Scrubs and Resilvers http://broken.net/uncategorized/zfs-performance-tuning-for-scrubs-and-resilvers/]]. Rather than change all of their recommended parameters at once, I opted to make one change at a time and observe the effects (just in case a change caused the server to choke). The first change I made was to set ((zfs_scrub_delay)) to _0_; this immediately accelerated the scrub rate to 100 Mbytes/sec. Let's start with a quote from [[the code in ((dsl_scan.c)) https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/fs/zfs/dsl_scan.c#L60]]: .pn prewrap on int zfs_scrub_delay = 4; /* number of ticks to delay scrub */ int zfs_scan_idle = 50; /* idle window in clock ticks */ How these variables are used is that every time a ZFS scrub is about to issue a read request, it checks to see if some normal read or write IO has happened within ((zfs_scan_idle)) ticks. If it has, it delays ((zfs_scrub_delay)) ticks before issuing the IO or doing anything else. If your pool is sufficiently busy to hit this condition more or less all of the time, ZFS scrubs will only be able to make at most a relatively low number of reads a second; if _HZ_ is how many ticks in a second, the issue rate is ((HZ / 4)) by default. In standard OmniOS kernels, _HZ_ is almost always 100; that is, there are 100 ticks a second. If your regular pool users are churning around enough to do one actual IO every half a second, your scrubs are clamped to no more than 25 reads a second. If each read is for a full 128 KB ZFS block, that's a scrub rate of about 3.2 MBytes/sec at most (and there are other things that can reduce it, too). Setting ((zfs_scrub_delay)) to 0 eliminates this clamping of scrub reads in the face of other IO; instead your scrub is on much more equal footing with active user IO. Unfortunately you cannot set it to any non-zero value lower than 1 tick, and 1 tick will clamp you to 100 reads a second, which is probably not fast enough for many people. This does not eliminate slowness due to scrubs (and resilvers) potentially having to do a lot of random reads, so it will not necessarily eliminate all of your scrub speed problems. But if you have a pool that seems to scrub at a frustratingly variable speed, sometimes scrubbing in a day and sometimes taking all week, you are probably running into ZFS scrubs backing off in the face of other IO and it's worth exploring this tunable and the others in [[those https://serverfault.com/questions/499739/tuning-zfs-scrubbing-141kb-s-running-for-15-days]] [[links http://broken.net/uncategorized/zfs-performance-tuning-for-scrubs-and-resilvers/]]. On the other tunables, I believe that it's relatively harmless and even useful to tune up ((zfs_scan_min_time_ms)), ((zfs_resilver_min_time_ms)), and ((zfs_top_maxinflight)). Certainly I saw no problems on our server when I set ((zfs_scan_min_time_ms)) to 5000 and increased ((zfs_top_maxinflight)). However I can't say for sure that it's useful, as our scrub rate had already hit its maximum rate from just the ((zfs_scrub_delay)) change. (And I'm still reading the current Illumos ZFS kernel code to try to understand what these additional tunables really do and mean.) === Sidebar: How to examine and set these tunable variables To change kernel tunables like this, you need to use '_mdb -kw_' to enable writing to things. To see their value, I recommend using '_::print_', eg: > > zfs_scrub_delay ::print -d > 4 > > zfs_scan_idle ::print -d > 0t50 To set the value, you should to use _/W_, *not* the _/w_ that [[ZFS: Performance Tuning for Scrubs and Resilvers http://broken.net/uncategorized/zfs-performance-tuning-for-scrubs-and-resilvers/]] says. The _w_ modifier is for 2-byte shorts, not 4-byte ints, and all of these variables are 4-byte ints (as you can see with '_::print -t_' and '_::sizeof_' if you want). A typical example is: > > zfs_scrub_delay/W0 > zfs_scrub_delay:0x4 = 0x0 The _/W_ stuff accepts decimal numbers as '0tNNNN' (as '_::print -d_' shows them, unsurprisingly), so you can do things like: > > zfs_scan_min_time_ms/W0t5000 > zfs_scan_min_time_ms: 0x3e8 = 0x1388 (Using '_/w_' will work on the x86 because the x86 is a little-endian architecture, but please don't get into the habit of doing that. My person view is that if you're going to be poking values into kernel memory it's very much worth being careful about doing it right.)