How we periodically scrub our ZFS pools

May 5, 2009

The problem with the simple approach to scrubbing your ZFS pools (just do 'zpool scrub ...' every so often) is that ZFS pool scrubs put enough of a load on our systems that we don't want to do them during the week and we don't want to do more than one of them at once (well, not more than one per fileserver). And we certainly don't want to have to manage the whole process by hand. So recently I wrote a script to automate the process.

The script's job is to scrub pools one by one during the weekend, if they haven't been scrubbed too recently and they're healthy. To tell if pools have been scrubbed recently, we keep a flag file in the root filesystem of the pool; the modification time of the file is when we kicked off the last scrub.

(As it happens, we don't use the root filesystem of our pools for anything and they're always mounted in a consistent place, so the flag file isn't disturbing anything and it's easy to find.)

The script gets started from cron early on Saturday morning and then runs in the background repeatedly starting a pool scrub and waiting for it to finish. In the Unix tradition of clubbing problems with existing programs, it uses find on all of the flag files to find out which flag files are old enough that their pools are candidates for scrubbing, and then ls to order them from oldest to newest so that it can find the oldest healthy pool. Waiting for pool scrubs to finish is done the brute force way; the script repeatedly runs 'zpool status' and waits until there are no 'scrub:' lines that indicate ongoing scrubs or resilvers.

(Except not. Because I am paranoid, it works the other way around; it throws away all 'scrub:' lines that it knows are good, and if there's anything left it assumes that a pool is still scrubbing or resilvering. This overcaution may cause us problems someday.)

The script exits when there are no pools left to scrub or if is after its exit time, currently Monday at 1am. (This doesn't quite mean that pool scrubbing will stop at Monday at 1am; it means that no pool scrubs will start after that point. Our biggest pools scrub in six and a half hours currently, so even in the worst case we should be done before 8am Monday.)

Written on 05 May 2009.
« An inexplicable omission in bash's sourcing of .bashrc
An irritation with Linux's 'mount -t nfs' output »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue May 5 00:31:56 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.