2020-02-10
Doing frequent ZFS scrubs lets you discover problems close to when they happened
Somewhat recently, the ZFS on Linux mailing list had a discussion of how frequently you should do ZFS scrubs, with a number of people suggesting that modern drives only really need relatively infrequent scrubs. As I was reading through the thread as part of trying to catch up on the list, it struck me that there is a decent reason for scrubbing frequently despite this. If we assume that scrubs surface existing problems that had previously been silent (instead of creating new ones), doing frequent scrubs lowers the mean time before you detect such problems.
Lowering the mean time to detection has the same advantage it does in programming (with things like unit tests), which is that it significantly narrows down when the underlying problem could have happened. If you scrub data once a month and you find a problem in a scrub, the problem could have really happened any time in the past month; if you scrub every week and find a problem, you know it happened in the past week. Relatedly, the sooner you detect that a problem happened in the recent past, the more likely you are to still have logs, traces, metrics, and other information that might let you look for anomalies and find a potential cause (beyond 'the drive had a glitch', because that's not always the problem).
In a modern ZFS environment with sequential scrubs (or just SSDs), scrubs are generally fast and low impact (although it depends on your IO load), so the impact of doing them every week for all of your data is probably low. I try to scrub the pools on my personal machines every week, and I generally don't notice. Now that I'm thinking about scrubs this way, I'm going to try to be more consistent about weekly scrubs.
(Our fileservers scrub each pool once every four weeks on a rotating basis. We could lower this, even down to once a week, but despite what I've written here I suspect that we're not going to bother. We don't see checksum errors or other problems very often, and we probably aren't going to do deep investigation of anything that turns up. If we can trace a problem to a disk IO error or correlate it with an obvious and alarming SMART metric, we're likely to replace the disk; otherwise, we're likely to clear the error and see if it comes back.)