Why big Exim queues are a problem for us in practice

June 30, 2017

In light of my recent entry on how our mail system should probably be able to create backpressure, you might wonder why we even need to worry about 'too large' queue sizes in the first place. Exim generally performs quite well under load and doesn't have too many problems dealing with pretty large queues (provided that your machines have enough RAM and fast enough disks, since the queue lives on disk in multiple files). Even in our own mail system we've seen queues of a few thousand messages be processed quite fast and without any particular problem.

(In some ways this speed is a disadvantage. If you have an account compromise, Exim is often perfectly capable of spraying out large amounts of spam email much faster than you can catch and stop it.)

In general I think you always want to have some sort of maximum queue size, because a runaway client machine can submit messages (and have Exim accept them) at a frightening speed. Your MTA can't actually deliver such an explosion anywhere near as fast as the client can submit more messages, so sooner or later you will run into inherent limits like overly-large directories that slow down everything that touches them or queue runners that are spending far too long scanning through hundreds of thousands of messages looking for ones to retry.

(A runaway client at this level might seem absurd, but with scripts, crontab, and other mistakes you can have a client generate tens of complaint messages a second. Every second.)

In our environment in specific, the problem is local delivery, especially people who filter local delivery for some messages into their home directories. Our NFS fileservers can only do so many operations a second, total, and when you hit that limit everyone starts being delayed, not just the MTA (or the server the MTA is running on). If a runaway surge of email is all directed to a single spot or to a small number of spots, we've seen the resulting delivery volume push an already quite busy NFS fileserver into clear overload, which ripples out to many of our machines. This means that a surge of email doesn't just affect the target of the surge, or even our mail system in general; under the wrong circumstances, it can affect our entire environment.

(A surge of delivery to /var/mail is more tolerable for various reasons, and a surge of delivery to external addresses is pretty close to 'we don't care unless the queue becomes absurdly large'. Well, apart from the bit where it might be spam and high outgoing volumes might get our outgoing email temporarily blacklisted in general.)

Ironically this is another situation where Exim's great efficiency is working against us. If Exim was not as fast as it is, it would not be able to process so many deliveries in such a short amount of time and thus it would not be hitting our NFS fileservers as hard. A mailer that maxed out at only a few local deliveries a second would have much less impact here.

Comments on this page:

By Twirrim at 2017-07-02 17:24:33:

When I last used Exim at scale (for an ISP), large mail queues were indicative of one thing: Spammers had got an account and were using us to send spam. On and off that was a regular pain in the neck.

One of the companies ours had bought used to have a really atrocious standard for sign ups, and unsurprisingly they had issues keeping a mail platform running, and the company opted to offload it on to us!

I had a series of one-liners that I could use to filter down the mail queue and figure out which account was the source, but I didn't really have the skills at the time to do that in a proper ongoing fashion. Sometimes I wish I could go back to then with the skills I have now and completely shake up that whole platform.

I do recall that we used XFS at the time for the file system that the queues lived on because XFS handled quick, short lived files very well, faster than ext4.

Written on 30 June 2017.
« The TLDs of sender addresses for a week of our spam (June 2017 edition)
Our current generation fileservers have turned out to be too big »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jun 30 00:47:24 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.