Our ZFS fileservers aren't happy when you do NFS writes to a full filesystem
The ZFS pools on our fileservers all have overall pool quotas, ultimately because of how we sell storage to people, and we've historically had problems when a pool fills completely up to its quota limit and people keep writing to it. In the past, this has led to fileserver lockups. Today I got a reminder of something I think we've seen before, which is that we can also get problems when just a filesystem fills up to its individual quota limit even if the pool is still under its overall quota.
The symptoms are less severe, in that the fileserver in question
only get fairly unresponsive to NFS (especially to the machine that
the writes were coming from) instead of locking up. This was somewhat
variable and may have primarily affected the particular filesystem
or perhaps the particular pool it's in, instead of all of the
filesystems and pools on the fileserver; I didn't attempt to gather
this data during the recent incident where I re-observed this, but
certainly some machines could still do things like issue
against the fileserver.
(This was of course our biggest fileserver.)
During the incident, the fileserver was generally receiving from the network at full line bandwidth; although I don't know for sure, I'm guessing that these were NFS writes. DTrace monitoring showed that it generally had several hundred outstanding NFS requests but wasn't actually doing much successful NFS IO (not surprising, if all of this traffic was writes that were getting rejected because the filesystem had hit its quota limits). Our fileservers used to get badly overloaded from too-fast NFS write IO in general, but that was fixed several years ago; still, this could be related.
Our DTrace stuff did report (very) long NFS operations and that report eventually led me to the source and let me turn it off. When the writes stopped, the fileserver recovered almost immediately and became fully responsive, including to the NFS client machine that was most affected by this.
How relevant this is to current OmniOS CE and Illumos is an open question; we're still running the heavily unsupported OmniOS r151014, and not a completely up to date version of it. Never the less, I feel like writing it down. Perhaps now I'll remember to check for full filesystems the next time we have a mysterious fileserver problem.
(We will probably not attempt to investigate this at all on our current fileservers, since our next general will not run any version of Illumos.)