Why big Exim queues are a problem for us in practice
In light of my recent entry on how our mail system should probably be able to create backpressure, you might wonder why we even need to worry about 'too large' queue sizes in the first place. Exim generally performs quite well under load and doesn't have too many problems dealing with pretty large queues (provided that your machines have enough RAM and fast enough disks, since the queue lives on disk in multiple files). Even in our own mail system we've seen queues of a few thousand messages be processed quite fast and without any particular problem.
(In some ways this speed is a disadvantage. If you have an account compromise, Exim is often perfectly capable of spraying out large amounts of spam email much faster than you can catch and stop it.)
In general I think you always want to have some sort of maximum queue size, because a runaway client machine can submit messages (and have Exim accept them) at a frightening speed. Your MTA can't actually deliver such an explosion anywhere near as fast as the client can submit more messages, so sooner or later you will run into inherent limits like overly-large directories that slow down everything that touches them or queue runners that are spending far too long scanning through hundreds of thousands of messages looking for ones to retry.
(A runaway client at this level might seem absurd, but with scripts, crontab, and other mistakes you can have a client generate tens of complaint messages a second. Every second.)
In our environment in specific, the problem is local delivery, especially people who filter local delivery for some messages into their home directories. Our NFS fileservers can only do so many operations a second, total, and when you hit that limit everyone starts being delayed, not just the MTA (or the server the MTA is running on). If a runaway surge of email is all directed to a single spot or to a small number of spots, we've seen the resulting delivery volume push an already quite busy NFS fileserver into clear overload, which ripples out to many of our machines. This means that a surge of email doesn't just affect the target of the surge, or even our mail system in general; under the wrong circumstances, it can affect our entire environment.
(A surge of delivery to /var/mail
is more tolerable for various
reasons, and a surge of delivery to external addresses is pretty
close to 'we don't care unless the queue becomes absurdly large'.
Well, apart from the bit where it might be spam and high outgoing
volumes might get our outgoing email temporarily blacklisted in
general.)
Ironically this is another situation where Exim's great efficiency is working against us. If Exim was not as fast as it is, it would not be able to process so many deliveries in such a short amount of time and thus it would not be hitting our NFS fileservers as hard. A mailer that maxed out at only a few local deliveries a second would have much less impact here.
Comments on this page:
|
|