2009-05-15
Why df on an NFS-mounted ZFS filesystem can give odd results
Suppose that you have a ZFS pool with various filesystems, and while
the filesystems have data in them, you're just using the pool as a
container; the top level pool filesystem has no data itself. In this
situation, a df of the pool on the Solaris host will show something
like:
Filesystem size used avail tank01 300G 20K 40G
Translated, we can see that the pool itself is 300G, the pool's top level filesystem has nothing in it, and there's 40G of unused space left in the pool; the rest is taken up by sub-filesystems, snapshots, and so on.
However, if you NFS mount the pool itself on a client and do a df on the client, what you will see is rather different:
Filesystem size used avail /tank01 40G 20K 40G
Suddenly your pool size has, well, disappeared.
Filesystems with quotas will show equally odd df results on NFS
clients. If the pool has enough space left that the filesystem's size is
limited by its quota, you will see the correct (quota-based) values for
everything. However, if the pool starts running out of overall space
the (total) size of the (quota-limited) filesystems starts shrinking,
sometimes dramatically. All of this can be very alarming and upsetting
to users, especially if it leads them to think that they haven't got
space that they've paid for.
It turns out that all of this is because of a fundamental limit in
the NFS v3 protocol combined with a decision made by the ZFS code (or
perhaps the overall Solaris NFS server code). Filesystem information is
queried by the NFS v3 FSSTAT operation, but the structure it returns
only contains information about the total filesystem size and the
remaining available space; there is no explicit field for 'space used'.
(NFS v3 FSSTAT does draw a distinction between 'free space' and 'free
space that can be allocated', so it can handle various sorts of overhead
and reserved space.)
This creates a dilemma for ZFS: do you return accurate total size and
space available, leading to potentially completely inaccurate used
figures, or do you make the total size be the sum of the space used and
the space available, so clients show a correct used figure? As we can
see, Solaris has chosen the second option.
(Okay, there's a third option: you could return the correct total size and an available space figure that was total size minus the used space. I think this would be even crazier than the other options.)
2009-05-05
How we periodically scrub our ZFS pools
The problem with the simple approach to scrubbing your ZFS pools (just
do 'zpool scrub ...' every so often) is that ZFS pool scrubs put
enough of a load on our systems that we don't want to do them during
the week and we don't want to do more than one of them at once (well,
not more than one per fileserver). And we certainly don't want to have
to manage the whole process by hand. So recently I wrote a script to
automate the process.
The script's job is to scrub pools one by one during the weekend, if they haven't been scrubbed too recently and they're healthy. To tell if pools have been scrubbed recently, we keep a flag file in the root filesystem of the pool; the modification time of the file is when we kicked off the last scrub.
(As it happens, we don't use the root filesystem of our pools for anything and they're always mounted in a consistent place, so the flag file isn't disturbing anything and it's easy to find.)
The script gets started from cron early on Saturday morning and then
runs in the background repeatedly starting a pool scrub and waiting for
it to finish. In the Unix tradition of clubbing problems with existing
programs, it uses find on all of the flag files to find out which flag
files are old enough that their pools are candidates for scrubbing, and
then ls to order them from oldest to newest so that it can find the
oldest healthy pool. Waiting for pool scrubs to finish is done the brute
force way; the script repeatedly runs 'zpool status' and waits until
there are no 'scrub:' lines that indicate ongoing scrubs or resilvers.
(Except not. Because I am paranoid, it works the other way around;
it throws away all 'scrub:' lines that it knows are good, and if
there's anything left it assumes that a pool is still scrubbing or
resilvering. This overcaution
may cause us problems someday.)
The script exits when there are no pools left to scrub or if is after its exit time, currently Monday at 1am. (This doesn't quite mean that pool scrubbing will stop at Monday at 1am; it means that no pool scrubs will start after that point. Our biggest pools scrub in six and a half hours currently, so even in the worst case we should be done before 8am Monday.)