An interesting case of NFS traffic (probably) holding a ZFS snapshot busy

August 22, 2016

We have a few filesystems on our fileservers that are considered sufficiently important that we take hourly snapshots during the working day. We use a simple naming and expiry scheme for these snapshots, where they're called <Day>-<Hour> (eg Tue-15) and the script simply deletes any old version before creating the new one. Both because it's the default and because it enables self-serve restores, we NFS-export the ZFS snapshots as well as the main filesystem. Recently that script threw up an error:

cannot destroy snapshot POOL/h/NNN@Mon-16: dataset is busy
cannot create snapshot 'POOL/h/NNN@Mon-16': dataset already exists

We believe that this ultimately happened because an hour or so two beforehand, a runaway IMAP process was traversing its way through that ZFS snapshot via the NFS export. The runaway IMAP process had been terminated well before this, but that might not have mattered enough; an NFS server doesn't know when a NFS client is done with the filehandles it has requested, so the server needs to guess and it may well guess conservatively (saying, for example, 'if I still have them in my server side cache, they're not old enough yet').

This was several weeks ago and the snapshot in question was quietly recycled a week later without any problems, so this did go away after a while. I can't even definitely say that past NFS activity in the snapshot was the problem; we haven't tried to reproduce it, and unfortunately as far as I know OmniOS lacks tools to give us visibility into this sort of thing (fuser reported nothing for the snapshot, for example, which is not surprising; there was no user-level activity on the fileserver that involved the snapshot).

This instance wasn't urgent and went away on its own. I'm not sure what we'd do if these weren't the case, because I don't know if there's any good ways of pushing the kernel to give up things like old(er) NFS filehandles and so on. Shutting down NFS service or rebooting the fileserver would probably do it, but both are rather drastic steps.

(It may be possible to write some DTrace to give us more information about why a dataset is still busy. Or, since DTrace is not always the answer to everything, possibly mdb can give us results too.)

Written on 22 August 2016.
« My pragmatic decision on GNU Emacs versus vim for my programming
A belated realization about 'TLS suicide' and user CGIs et al »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Aug 22 00:33:48 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.