Even on SSDs, ongoing activity can slow down ZFS scrubs drastically
Back in the days of our OmniOS fileservers, which used HDs (spinning rust) across iSCSI, we wound up changing kernel tunables to speed up ZFS scrubs and saw a significant improvement. When we migrated to our current Linux fileservers with SSDs, I didn't bother including these tunables (or the Linux equivalent), because I expected that SSDs were fast enough that it didn't matter. Indeed, our SSD pools generally scrub like lightning.
Then, this weekend, a ZFS pool with 1.68 TB of space used took two days to scrub (48:15, to be precise). This is not something that happens normally; this size of pool usually scrubs much faster, on the order of a few hours. When I poked at it a bit none of the disks seemed unusually slow and there were no signs of other problems, it was just that the scrub was running slowly. However, looking at NFS client metrics in our metrics system suggested that there was continuous ongoing NFS activity to some of the filesystems in that pool.
Although I don't know for sure, this looks like a classical case of even a modest level of regular ZFS activity causing the ZFS scrub code to back off significantly on IO. Since this is on SSDs, this isn't really necessary (at least for us); we could almost certainly sustain both a more or less full speed scrub and our regular read IO (significant write IO might be another story, but that's because it has some potential performance effects on SSDs in general). However, with no tuning our current version of ZFS is sticking to conservative defaults.
In one sense, this isn't surprising, since it's how ZFS has traditionally reacted to IO during scrubs. In another sense, it is, because it's not something I expected to see affect us on SSDs; if I had expected to see it, I'd have carried forward our ZFS tunables to speed up scrubs.
(Now that I look at our logged data, it appears that ZFS scrubs on this pool have been slow for some time, although not 'two days' slow. They used to complete in a couple of hours, then suddenly jumped to over 24 hours. More investigation may be needed.)