Even on SSDs, ongoing activity can slow down ZFS scrubs drastically

August 26, 2020

Back in the days of our OmniOS fileservers, which used HDs (spinning rust) across iSCSI, we wound up changing kernel tunables to speed up ZFS scrubs and saw a significant improvement. When we migrated to our current Linux fileservers with SSDs, I didn't bother including these tunables (or the Linux equivalent), because I expected that SSDs were fast enough that it didn't matter. Indeed, our SSD pools generally scrub like lightning.

(Our Linux fileservers use a ZFS version before sequential scrubs (also). It's possible that sequential scrub support would change this story.)

Then, this weekend, a ZFS pool with 1.68 TB of space used took two days to scrub (48:15, to be precise). This is not something that happens normally; this size of pool usually scrubs much faster, on the order of a few hours. When I poked at it a bit none of the disks seemed unusually slow and there were no signs of other problems, it was just that the scrub was running slowly. However, looking at NFS client metrics in our metrics system suggested that there was continuous ongoing NFS activity to some of the filesystems in that pool.

Although I don't know for sure, this looks like a classical case of even a modest level of regular ZFS activity causing the ZFS scrub code to back off significantly on IO. Since this is on SSDs, this isn't really necessary (at least for us); we could almost certainly sustain both a more or less full speed scrub and our regular read IO (significant write IO might be another story, but that's because it has some potential performance effects on SSDs in general). However, with no tuning our current version of ZFS is sticking to conservative defaults.

In one sense, this isn't surprising, since it's how ZFS has traditionally reacted to IO during scrubs. In another sense, it is, because it's not something I expected to see affect us on SSDs; if I had expected to see it, I'd have carried forward our ZFS tunables to speed up scrubs.

(Now that I look at our logged data, it appears that ZFS scrubs on this pool have been slow for some time, although not 'two days' slow. They used to complete in a couple of hours, then suddenly jumped to over 24 hours. More investigation may be needed.)

Written on 26 August 2020.
« My home desktop is still locking up when it gets too cold (and what next)
Firefox 80 and my confusion over its hardware accelerated video on Linux »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Aug 26 22:48:14 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.