The status of our problems with overloaded OmniOS NFS servers
Back at the start of May, we narrowed down our production OmniOS problems to the fact that OmniOS NFS servers have problems with sustained 'too fast' write loads. Since then there have been two pieces of progress and today I feel like writing about them.
The first is that this was identified as a definite Illumos issue. It turns out that Nexenta stumbled over this and fixed it in their own tree in this commit. The commit has since been upstreamed to the Illumos master here (issue) and has made it into the repo for OmniOS r151014 (although I believe it's not yet in a released update). OmniTI's Dan McDonald did the digging to find the Nexenta change after I emailed the OmniOS mailing list and built us a kernel with it patched in that we were able to run in our test environment, where it passed with flying colors. This is clearly our long term solution to the problem.
(In case it's not obvious, Dan McDonald was super helpful to us here, which we're quite grateful for. Practically the moment I sent in my initial email, our problem was on the way to getting solved.)
In the short term we found out that taking a fileserver from 64 GB of RAM to 128 GB of RAM made us no longer able to reproduce the problem in both our test environment and the production fileserver that was having problems. In addition it appears to make our test fileserver significantly more responsive under heavy load. Currently the production fileserver is running without problems with 128 GB of RAM and 4096 NFS server threads (and an increase in kernel rpcmod parameters to go with it). It's definitely survived getting into memory use situations that we'd have expected to lock it up based on prior experience.
(At the moment we've only upgraded the one problem fileserver to 128 GB and left the others at 64 GB. The others get much less load due to some decisions we made during the migration from the old fileservers to our current ones.)
We still have some other issues with our OmniOS fileservers, but for now the important thing is that we have what seems to be a stable production fileserver environment. After all our problems getting here, that is a very big relief. We can live with 1G Ethernet instead of 10G; we can't live with fileservers that lock up under load.