Sometimes knowing causes does you no good (and sensible uses of time)
Yesterday, I covered our OmniOS fileserver problem with overload and mentioned that the core problem seems to be (kernel) memory exhaustion. Of course once we'd identified this I immediately started coming up with lots of theories about what might be eating up all the memory (and then not giving it back), along with potential ways to test these theories. This is what sysadmins do when we're confronted with problems, after all; we try to understand them. And it can be peculiarly fun and satisfying to run down the root cause of something.
(For example, one theory is 'NFS TCP socket receive buffers', which would explain why it seems to need a bunch of clients all active.)
Then I asked myself an uncomfortable question: was this going to actually help us? Specifically, was it particularly likely to get us any closer to having OmniOS NFS fileservers that did not lock up under surges of too-high load? The more I thought about that, the more gloomy I felt, because the cold hard answer is that knowing the root cause here is unlikely to do us any good.
Some issues are ultimately due to simple and easily fixed bugs, or turn out to have simple configuration changes that avoid them. It seems unlikely that either are the case here; instead it seems much more likely to be a misdesigned or badly designed part of the Illumos NFS server code. Fixing bad designs is never a simple code change and they can rarely be avoided with configuration changes. Any fix is likely to be slow to appear and require significant work on someone's part.
This leads to the really uncomfortable realization that it is probably not worth spelunking this issue to explore and test any of these theories. Sure, it'd be nice to know the answer, but knowing the answer is not likely to get us much closer to a fix to a long-standing and deep issue. And what we need is that fix, not to know what the cause is, because ultimately we need fileservers that don't lock up every so often if things go a little bit wrong (because things go a little bit wrong on a regular basis).
This doesn't make me happy, because I like diagnosing problems and finding root causes (however much I gripe about it sometimes); it's neat and gives me a feeling of real accomplishment. But my job is not about feelings of accomplishment, it's about giving our users reliable fileservice, and it behooves me to spend my finite time on things that are most likely to result in that. Right now that does not appear to involve diving into OmniOS kernel internals or coming up with clever ways to test theories.
(If we had a lot of money to throw at people, perhaps the solution would be 'root cause the problem then pay Illumos people to do the kernel development needed to fix it'. But we don't have anywhere near that kind of money.)