The external delivery delays we see on our central mail machine
These days, our central mail machine almost always has a queue of delayed email that it's trying to deliver to the outside world. Sometimes this is legitimate email and it's being delayed because of some issue on the remote end, ranging from the remote end being down or unreachable to the destination user being over quota. But quite a lot of time it is for some variety of email that we didn't successfully recognize as spam and are either forwarding to some place that does (but that hasn't outright rejected the email yet) or we've had a bounce (or an autoreply) and we're trying to deliver that message to the envelope sender address, except that the sender in question doesn't have anything there to accept (or reject) it (which is very common behavior, for all that I object strongly to it).
(Things that we recognize as spam either don't get forwarded at all or go out through a separate machine, where bounces are swallowed. At the moment users can autoreply to spam messages if they work at it, although we try to avoid it by default and we're going to do better at that in the near future.)
Our central mail machine has pretty generous timeouts for delivering external email. For regular email, most destinations get six days, and for bounces or autoreplies most destinations get three days. These durations are somewhat arbitrary, so today I found myself wondering how long our successful external deliveries took and what the longest delays for successful deliveries actually were. The results surprised me.
(By 'external deliveries' I mean deliveries to all other mail systems, both inside and outside the university. I suppose I will call these 'SMTP deliveries' now.)
Most of my analysis was done on the last roughly 30 full days of SMTP deliveries. Over this time, we did about 136,000 successful SMTP deliveries to other systems. Of these, only 31,000 took longer than one second to be delivered (from receiving the message to the remote end accepting the SMTP transaction). That's still about 23% of our messages, but it's still impressive that more than 75% of the messages were sent onward within a second. A further 15,800 completed in two seconds, while only 5,780 took longer than ten seconds; of those, 3,120 were delivered in 60 seconds or less.
Our Exim queue running time is every five minutes, which means that a message that fails its first delivery or is queued for various reasons will normally see its first re-delivery attempt within five minutes or so. Actually there's a little bit of extra time because the queue runner may have to try other messages before it gets to you, so let's round this up to six minutes. Only 368 successfully delivered messages took longer than six minutes to be delivered, which suggests that almost everything is delivered or re-delivered in the first queue run in which it's in the queue. At this point I'm going to summarize:
- 63 messages delivered in between 6 minutes and 11 minutes.
- 252 messages delivered in between 11 minutes and an hour.
- 24 messages delivered in between one and two hours.
- 29 messages delivered in over two hours, with the longest five being delivered in 2d 3h 22m, 2d 3h 14m, 2d 0h 11m, 1d 5h 20m, and 1d 2h 39m respectively. Those are the only five that took over a day to be delivered.
We have mail logs going back 400 days, and over that entire time only 45 messages were successfully delivered with longer queue times than our 2d 3h 22m champion from the past 30 days. On the other hand, our long timeouts are actually sort of working; 12 of those 45 messages took at least five days to be delivered. One lucky message was delivered in 6d 0h 2m, which means that it was undergoing one last delivery attempt before being expired.
Despite how little good our relatively long expiry times are doing for successfully delivered messages, we probably won't change them. They seem to do a little bit of good every so often, and our queues aren't particularly large even when we have messages camped out in them going nowhere. But if we did get a sudden growth of queued messages that were going nowhere, it's reassuring to know that we could probably cut down our expire times quite significantly without really losing anything.
Garbage collection and the underappreciated power of good enough
Today I read The success of Go heralds that of Rust (via). In it the author argues that Rust is poised to take over from Go because Rust is a more powerful language environment. As one of their arguments against Go, the author says:
All the remaining issues with Go stem from three design choices:
- It’s garbage collected, rather than having compile time defined lifetimes for all its resources. This harms performance, removes useful concepts (like move semantics and destructors) and makes compile-time error checking less powerful.
Fix those things, and Go essentially becomes the language of the future. [...]
I think the author is going to be disappointed about what happens next, because in my opinion they've missed something important about how people use programming languages and what matters to people.
To put it simply, good garbage collection is good enough in practice for almost everyone and it's a lot easier than the fully correct way of carefully implementing lifetimes for all resources. And Go has good garbage collection, because the Go people considered it a critical issue and spent a lot of resources on it. This means that Go's garbage collection is not an issue in practice for most people, regardless of what the author thinks, and thus that Rust's compile time defined lifetimes are not a compelling advantage for Rust that will draw many current (or future) Go users from Go to Rust.
There are people who need the guarantees and minimal memory usage that lifetimes for memory gives you, and there are programs that blow up under any particular garbage collection scheme. But most people do not have such needs or such programs. Most programs are perfectly fine with using some more memory than they strictly need and spending some more CPU time on garbage collection, and most people are fine with this because they have other priorities than making their programs run as fast and with as minimal resource use as possible.
(These people are not being 'lazy'. They are optimizing what matters to them in their particular environment, which may well be things like speed to deployment.)
(Good enough changes over time and over scale. What is good enough at small scale is not good enough at larger scale, but many times you never need to handle that larger scale.)
What matters to most people most of the time is not perfection, it is good enough. They may go beyond that, but you should not count on it, and especially you should not count on perfection pulling lots of people away from good enough. We have plenty of proof that it doesn't.