Optimizing finding unowned files on our ZFS fileservers

September 6, 2015

One of the things we do every weekend is look for files on our fileservers that have wound up being owned by people who don't exist (or, more commonly, who no longer exist). For a long time this was done with the obvious approach using find, which was basically this:

SFS=$(... generate FS list ...)
gfind -H $SFS -mount '('  -nogroup -o -nouser ')' -printf ...

The problem with this is that we have enough data in enough filesystems that running a find over the entire thing can take a significant amount of time. On our biggest fileserver, we've seen this take on the order of ten hours, which either delays the start of our weekly pool scrubs or collides with them, slowing them down (and they can already be slow enough). Recently I realized that we can do much better than this by not checking most of our filesystems.

The trick is to use ZFS's existing infrastructure for quotas. As part of this ZFS maintains information on the amount of space used by every user and every group on each filesystems, which the 'zfs userspace' and 'zfs groupspace' commands will print out. As a side effect this gives you a complete list of every UID and GID that uses space in the filesystem, so all we have to do is scan the lists to see if there are any unknown ones in it. If all UIDs and GIDs using space on the filesystem exist, we can completely skip running find on it; we know our find won't find anything.

Since our filesystems don't normally have any unowned files on them, this turns into a massive win. In the usual case we won't scan any filesystems on a fileserver, and even if we do scan some we'll generally only scan a handful. It may even make this particular process fast enough so that we can just run it after deleting accounts, instead of waiting for the weekend.

By the way, the presence of unknown UIDs or GIDs in the output of 'zfs *space' doesn't mean that there definitely are files that a find will pick up. The unowned files could be only in a snapshot, or they could be deleted files that are being held open by various things, including the NFS lock manager.


Comments on this page:

I'm surprised that you actually remove users instead of just disabling them. Care to expand on that?

By cks at 2015-09-07 00:36:52:

That's a sufficiently good question that I wound up writing an entry about the issue, WhyUserDeletion.

Written on 06 September 2015.
« Why we aren't tempted to use ACLs on our Unix machines
Why we wind up deleting user accounts »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Sep 6 01:10:49 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.