Our current generation fileservers have turned out to be too big
We have three production hard drive based NFS fileservers (and one special fileserver that uses SSDs). As things have turned out, usage is not balanced evenly across all three; one of them has by far the most space assigned and used on it and unsurprisingly is also the most active and busy fileserver.
(In retrospect, putting all of the pools for the general research group that most heavily uses our disk space on the same fileserver was perhaps not our best decision ever.)
It has been increasingly obvious to us for some time that this fileserver is simply too big. It hosts too much storage that is too actively used and as a result it's the server that most frequently encounters serious system issues and in general it frequently runs sufficiently close to its performance edge that a little extra load can push it over the edge. Even when everything is going well it's big enough to be unwieldy; scheduling anything involving it is hard, for example, because so many people use it.
(This fileserver also suffers the most from our multi-tenancy, since so many of its disks are used for so many active things.)
However this fileserver is not fundamentally configured any differently than the other two. It doesn't have less memory or more disks; it simply makes more use of them than the other two do. This means that all three of our fileservers are too big as designed. The only reason the other two aren't also too big in operation today is that not enough people have been interested in using them, so they don't have anywhere near as much space used and aren't as active in handling NFS traffic.
Now, how we designed our fileservers is not quite how they've wound up being operated, since they're running at 1G for all networking instead of 10G. It's possible that running at 10G would make this fileserver not too big, but I'm not particularly confident about that. The management issues would still be there, there would still be a large impact on a lot of people if (and when) the fileserver ran into problems, and I suspect that we'd run into limitations on disk IOPS and how much NFS fileservice a single machine can do even if we went to the extreme where all the disks were local disks instead of iSCSI. So I believe that in our environment, it's most likely that any fileserver with that much disk space is simply too big.
As a result of our experiences with this generation of fileservers, our next generation is all but certain to be significantly smaller, just so something like this can't possibly happen with them. This probably implies a number of other significant changes, but that's going to be another entry.