Our current generation fileservers have turned out to be too big
We have three production hard drive based NFS fileservers (and one special fileserver that uses SSDs). As things have turned out, usage is not balanced evenly across all three; one of them has by far the most space assigned and used on it and unsurprisingly is also the most active and busy fileserver.
(In retrospect, putting all of the pools for the general research group that most heavily uses our disk space on the same fileserver was perhaps not our best decision ever.)
It has been increasingly obvious to us for some time that this fileserver is simply too big. It hosts too much storage that is too actively used and as a result it's the server that most frequently encounters serious system issues and in general it frequently runs sufficiently close to its performance edge that a little extra load can push it over the edge. Even when everything is going well it's big enough to be unwieldy; scheduling anything involving it is hard, for example, because so many people use it.
(This fileserver also suffers the most from our multi-tenancy, since so many of its disks are used for so many active things.)
However this fileserver is not fundamentally configured any differently than the other two. It doesn't have less memory or more disks; it simply makes more use of them than the other two do. This means that all three of our fileservers are too big as designed. The only reason the other two aren't also too big in operation today is that not enough people have been interested in using them, so they don't have anywhere near as much space used and aren't as active in handling NFS traffic.
Now, how we designed our fileservers is not quite how they've wound up being operated, since they're running at 1G for all networking instead of 10G. It's possible that running at 10G would make this fileserver not too big, but I'm not particularly confident about that. The management issues would still be there, there would still be a large impact on a lot of people if (and when) the fileserver ran into problems, and I suspect that we'd run into limitations on disk IOPS and how much NFS fileservice a single machine can do even if we went to the extreme where all the disks were local disks instead of iSCSI. So I believe that in our environment, it's most likely that any fileserver with that much disk space is simply too big.
As a result of our experiences with this generation of fileservers, our next generation is all but certain to be significantly smaller, just so something like this can't possibly happen with them. This probably implies a number of other significant changes, but that's going to be another entry.
Why big Exim queues are a problem for us in practice
In light of my recent entry on how our mail system should probably be able to create backpressure, you might wonder why we even need to worry about 'too large' queue sizes in the first place. Exim generally performs quite well under load and doesn't have too many problems dealing with pretty large queues (provided that your machines have enough RAM and fast enough disks, since the queue lives on disk in multiple files). Even in our own mail system we've seen queues of a few thousand messages be processed quite fast and without any particular problem.
(In some ways this speed is a disadvantage. If you have an account compromise, Exim is often perfectly capable of spraying out large amounts of spam email much faster than you can catch and stop it.)
In general I think you always want to have some sort of maximum queue size, because a runaway client machine can submit messages (and have Exim accept them) at a frightening speed. Your MTA can't actually deliver such an explosion anywhere near as fast as the client can submit more messages, so sooner or later you will run into inherent limits like overly-large directories that slow down everything that touches them or queue runners that are spending far too long scanning through hundreds of thousands of messages looking for ones to retry.
(A runaway client at this level might seem absurd, but with scripts, crontab, and other mistakes you can have a client generate tens of complaint messages a second. Every second.)
In our environment in specific, the problem is local delivery, especially people who filter local delivery for some messages into their home directories. Our NFS fileservers can only do so many operations a second, total, and when you hit that limit everyone starts being delayed, not just the MTA (or the server the MTA is running on). If a runaway surge of email is all directed to a single spot or to a small number of spots, we've seen the resulting delivery volume push an already quite busy NFS fileserver into clear overload, which ripples out to many of our machines. This means that a surge of email doesn't just affect the target of the surge, or even our mail system in general; under the wrong circumstances, it can affect our entire environment.
(A surge of delivery to
/var/mail is more tolerable for various
reasons, and a surge of delivery to external addresses is pretty
close to 'we don't care unless the queue becomes absurdly large'.
Well, apart from the bit where it might be spam and high outgoing
volumes might get our outgoing email temporarily blacklisted in
Ironically this is another situation where Exim's great efficiency is working against us. If Exim was not as fast as it is, it would not be able to process so many deliveries in such a short amount of time and thus it would not be hitting our NFS fileservers as hard. A mailer that maxed out at only a few local deliveries a second would have much less impact here.