Wandering Thoughts archives

2009-05-15

Why df on an NFS-mounted ZFS filesystem can give odd results

Suppose that you have a ZFS pool with various filesystems, and while the filesystems have data in them, you're just using the pool as a container; the top level pool filesystem has no data itself. In this situation, a df of the pool on the Solaris host will show something like:

Filesystem  size  used  avail
tank01      300G   20K    40G

Translated, we can see that the pool itself is 300G, the pool's top level filesystem has nothing in it, and there's 40G of unused space left in the pool; the rest is taken up by sub-filesystems, snapshots, and so on.

However, if you NFS mount the pool itself on a client and do a df on the client, what you will see is rather different:

Filesystem  size  used  avail
/tank01      40G   20K    40G

Suddenly your pool size has, well, disappeared.

Filesystems with quotas will show equally odd df results on NFS clients. If the pool has enough space left that the filesystem's size is limited by its quota, you will see the correct (quota-based) values for everything. However, if the pool starts running out of overall space the (total) size of the (quota-limited) filesystems starts shrinking, sometimes dramatically. All of this can be very alarming and upsetting to users, especially if it leads them to think that they haven't got space that they've paid for.

It turns out that all of this is because of a fundamental limit in the NFS v3 protocol combined with a decision made by the ZFS code (or perhaps the overall Solaris NFS server code). Filesystem information is queried by the NFS v3 FSSTAT operation, but the structure it returns only contains information about the total filesystem size and the remaining available space; there is no explicit field for 'space used'.

(NFS v3 FSSTAT does draw a distinction between 'free space' and 'free space that can be allocated', so it can handle various sorts of overhead and reserved space.)

This creates a dilemma for ZFS: do you return accurate total size and space available, leading to potentially completely inaccurate used figures, or do you make the total size be the sum of the space used and the space available, so clients show a correct used figure? As we can see, Solaris has chosen the second option.

(Okay, there's a third option: you could return the correct total size and an available space figure that was total size minus the used space. I think this would be even crazier than the other options.)

ZFSNFSOddDfExplained written at 01:23:40; Add Comment

2009-05-05

How we periodically scrub our ZFS pools

The problem with the simple approach to scrubbing your ZFS pools (just do 'zpool scrub ...' every so often) is that ZFS pool scrubs put enough of a load on our systems that we don't want to do them during the week and we don't want to do more than one of them at once (well, not more than one per fileserver). And we certainly don't want to have to manage the whole process by hand. So recently I wrote a script to automate the process.

The script's job is to scrub pools one by one during the weekend, if they haven't been scrubbed too recently and they're healthy. To tell if pools have been scrubbed recently, we keep a flag file in the root filesystem of the pool; the modification time of the file is when we kicked off the last scrub.

(As it happens, we don't use the root filesystem of our pools for anything and they're always mounted in a consistent place, so the flag file isn't disturbing anything and it's easy to find.)

The script gets started from cron early on Saturday morning and then runs in the background repeatedly starting a pool scrub and waiting for it to finish. In the Unix tradition of clubbing problems with existing programs, it uses find on all of the flag files to find out which flag files are old enough that their pools are candidates for scrubbing, and then ls to order them from oldest to newest so that it can find the oldest healthy pool. Waiting for pool scrubs to finish is done the brute force way; the script repeatedly runs 'zpool status' and waits until there are no 'scrub:' lines that indicate ongoing scrubs or resilvers.

(Except not. Because I am paranoid, it works the other way around; it throws away all 'scrub:' lines that it knows are good, and if there's anything left it assumes that a pool is still scrubbing or resilvering. This overcaution may cause us problems someday.)

The script exits when there are no pools left to scrub or if is after its exit time, currently Monday at 1am. (This doesn't quite mean that pool scrubbing will stop at Monday at 1am; it means that no pool scrubs will start after that point. Our biggest pools scrub in six and a half hours currently, so even in the worst case we should be done before 8am Monday.)

ZFSPeriodicScrubbing written at 00:31:56; Add Comment

By day for May 2009: 5 15; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.