How we periodically scrub our ZFS pools
The problem with the simple approach to scrubbing your ZFS pools (just
zpool scrub ...' every so often) is that ZFS pool scrubs put
enough of a load on our systems that we don't want to do them during
the week and we don't want to do more than one of them at once (well,
not more than one per fileserver). And we certainly don't want to have
to manage the whole process by hand. So recently I wrote a script to
automate the process.
The script's job is to scrub pools one by one during the weekend, if they haven't been scrubbed too recently and they're healthy. To tell if pools have been scrubbed recently, we keep a flag file in the root filesystem of the pool; the modification time of the file is when we kicked off the last scrub.
(As it happens, we don't use the root filesystem of our pools for anything and they're always mounted in a consistent place, so the flag file isn't disturbing anything and it's easy to find.)
The script gets started from cron early on Saturday morning and then
runs in the background repeatedly starting a pool scrub and waiting for
it to finish. In the Unix tradition of clubbing problems with existing
programs, it uses
find on all of the flag files to find out which flag
files are old enough that their pools are candidates for scrubbing, and
ls to order them from oldest to newest so that it can find the
oldest healthy pool. Waiting for pool scrubs to finish is done the brute
force way; the script repeatedly runs '
zpool status' and waits until
there are no '
scrub:' lines that indicate ongoing scrubs or resilvers.
(Except not. Because I am paranoid, it works the other way around;
it throws away all '
scrub:' lines that it knows are good, and if
there's anything left it assumes that a pool is still scrubbing or
resilvering. This overcaution
may cause us problems someday.)
The script exits when there are no pools left to scrub or if is after its exit time, currently Monday at 1am. (This doesn't quite mean that pool scrubbing will stop at Monday at 1am; it means that no pool scrubs will start after that point. Our biggest pools scrub in six and a half hours currently, so even in the worst case we should be done before 8am Monday.)