== Our ZFS fileservers aren't happy when you do NFS writes to a full filesystem The ZFS pools on [[our fileservers ZFSFileserverSetupII]] all have overall pool quotas, ultimately because of [[how we sell storage to people ../sysadmin/HowWeSellStorage]], and we've historically had problems when a pool fills completely up to its quota limit and people keep writing to it. In the past, [[this has led to fileserver lockups ZFSNFSPoolQuotaProblem]]. Today I got a reminder of something I think we've seen before, which is that we can also get problems when just a filesystem fills up to its individual quota limit even if the pool is still under its overall quota. The symptoms are less severe, in that the fileserver in question only get fairly unresponsive to NFS (especially to the machine that the writes were coming from) instead of locking up. This was somewhat variable and may have primarily affected the particular filesystem or perhaps the particular pool it's in, instead of all of the filesystems and pools on the fileserver; I didn't attempt to gather this data during the recent incident where I re-observed this, but certainly some machines could still do things like issue _df_s against the fileserver. (This was of course [[our biggest fileserver ../sysadmin/FileserversDesignedTooBig]].) During the incident, the fileserver was generally receiving from the network at full line bandwidth; although I don't know for sure, I'm guessing that these were NFS writes. DTrace monitoring showed that it generally had several hundred outstanding NFS requests but wasn't actually doing much successful NFS IO (not surprising, if all of this traffic was writes that were getting rejected because the filesystem had hit its quota limits). Our fileservers used to get badly overloaded from too-fast NFS write IO in general, but [[that was fixed several years ago OmniOSNFSOverloadStatus]]; still, this could be related. [[Our DTrace stuff ZFSDTraceScripts]] did report (very) long NFS operations and that report eventually led me to the source and let me turn it off. When the writes stopped, the fileserver recovered almost immediately and became fully responsive, including to the NFS client machine that was most affected by this. How relevant this is to current [[OmniOS CE https://omniosce.org/]] and Illumos is an open question; we're still running the heavily unsupported OmniOS r151014, and not a completely up to date version of it. Never the less, I feel like writing it down. Perhaps now I'll remember to check for full filesystems the next time we have a mysterious fileserver problem. (We will probably not attempt to investigate this at all on our current fileservers, since [[our next general will not run any version of Illumos IllumosNoFutureHere]].)