The external delivery delays we see on our central mail machine

October 15, 2018

These days, our central mail machine almost always has a queue of delayed email that it's trying to deliver to the outside world. Sometimes this is legitimate email and it's being delayed because of some issue on the remote end, ranging from the remote end being down or unreachable to the destination user being over quota. But quite a lot of time it is for some variety of email that we didn't successfully recognize as spam and are either forwarding to some place that does (but that hasn't outright rejected the email yet) or we've had a bounce (or an autoreply) and we're trying to deliver that message to the envelope sender address, except that the sender in question doesn't have anything there to accept (or reject) it (which is very common behavior, for all that I object strongly to it).

(Things that we recognize as spam either don't get forwarded at all or go out through a separate machine, where bounces are swallowed. At the moment users can autoreply to spam messages if they work at it, although we try to avoid it by default and we're going to do better at that in the near future.)

Our central mail machine has pretty generous timeouts for delivering external email. For regular email, most destinations get six days, and for bounces or autoreplies most destinations get three days. These durations are somewhat arbitrary, so today I found myself wondering how long our successful external deliveries took and what the longest delays for successful deliveries actually were. The results surprised me.

(By 'external deliveries' I mean deliveries to all other mail systems, both inside and outside the university. I suppose I will call these 'SMTP deliveries' now.)

Most of my analysis was done on the last roughly 30 full days of SMTP deliveries. Over this time, we did about 136,000 successful SMTP deliveries to other systems. Of these, only 31,000 took longer than one second to be delivered (from receiving the message to the remote end accepting the SMTP transaction). That's still about 23% of our messages, but it's still impressive that more than 75% of the messages were sent onward within a second. A further 15,800 completed in two seconds, while only 5,780 took longer than ten seconds; of those, 3,120 were delivered in 60 seconds or less.

Our Exim queue running time is every five minutes, which means that a message that fails its first delivery or is queued for various reasons will normally see its first re-delivery attempt within five minutes or so. Actually there's a little bit of extra time because the queue runner may have to try other messages before it gets to you, so let's round this up to six minutes. Only 368 successfully delivered messages took longer than six minutes to be delivered, which suggests that almost everything is delivered or re-delivered in the first queue run in which it's in the queue. At this point I'm going to summarize:

  • 63 messages delivered in between 6 minutes and 11 minutes.
  • 252 messages delivered in between 11 minutes and an hour.
  • 24 messages delivered in between one and two hours.
  • 29 messages delivered in over two hours, with the longest five being delivered in 2d 3h 22m, 2d 3h 14m, 2d 0h 11m, 1d 5h 20m, and 1d 2h 39m respectively. Those are the only five that took over a day to be delivered.

We have mail logs going back 400 days, and over that entire time only 45 messages were successfully delivered with longer queue times than our 2d 3h 22m champion from the past 30 days. On the other hand, our long timeouts are actually sort of working; 12 of those 45 messages took at least five days to be delivered. One lucky message was delivered in 6d 0h 2m, which means that it was undergoing one last delivery attempt before being expired.

Despite how little good our relatively long expiry times are doing for successfully delivered messages, we probably won't change them. They seem to do a little bit of good every so often, and our queues aren't particularly large even when we have messages camped out in them going nowhere. But if we did get a sudden growth of queued messages that were going nowhere, it's reassuring to know that we could probably cut down our expire times quite significantly without really losing anything.

Written on 15 October 2018.
« Garbage collection and the underappreciated power of good enough
Quickly bashing together little utilities with Python is nice »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Oct 15 22:23:21 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.