An interesting case of NFS traffic (probably) holding a ZFS snapshot busy
We have a few filesystems on our fileservers
that are considered sufficiently important that we take hourly
snapshots during the working day. We use a simple naming and expiry
scheme for these snapshots, where they're called <Day>-<Hour> (eg
Tue-15
) and the script simply deletes any old version before
creating the new one. Both because it's the default and because it
enables self-serve restores, we NFS-export the ZFS snapshots as
well as the main filesystem. Recently that script threw up an
error:
cannot destroy snapshot POOL/h/NNN@Mon-16: dataset is busy cannot create snapshot 'POOL/h/NNN@Mon-16': dataset already exists
We believe that this ultimately happened because an hour or so two beforehand, a runaway IMAP process was traversing its way through that ZFS snapshot via the NFS export. The runaway IMAP process had been terminated well before this, but that might not have mattered enough; an NFS server doesn't know when a NFS client is done with the filehandles it has requested, so the server needs to guess and it may well guess conservatively (saying, for example, 'if I still have them in my server side cache, they're not old enough yet').
This was several weeks ago and the snapshot in question was quietly
recycled a week later without any problems, so this did go away
after a while. I can't even definitely say that past NFS activity
in the snapshot was the problem; we haven't tried to reproduce it,
and unfortunately as far as I know OmniOS lacks tools to give us
visibility into this sort of thing (fuser
reported nothing for
the snapshot, for example, which is not surprising; there was no
user-level activity on the fileserver that involved the snapshot).
This instance wasn't urgent and went away on its own. I'm not sure what we'd do if these weren't the case, because I don't know if there's any good ways of pushing the kernel to give up things like old(er) NFS filehandles and so on. Shutting down NFS service or rebooting the fileserver would probably do it, but both are rather drastic steps.
(It may be possible to write some DTrace to give us more information
about why a dataset is still busy. Or, since DTrace is not always the
answer to everything, possibly mdb
can give us results too.)
|
|