A learning experience: internal mail flow should never be allowed to bounce

November 15, 2012

The university runs a central email system for all undergraduates. Last week that system started bouncing incoming email, and in doing so it taught me an uncomfortable lesson that I now need to apply to our own mail environment.

You see, the university doesn't actually run this email system; almost all of it is outsourced to a third party email provider. While the undergraduate email domain is MX'd to university machines, they're just a relay; they immediately shuffle incoming mail off to the outside provider, who stores it and provides access to it and so on. The piece that broke down last week was the relaying step; the domain name the university relays to stopped resolving and so the relay machines started bouncing email with errors about 'unresolvable destination <blah>'.

The problem with bouncing email here is that this was not normal SMTP mail (where failure is routine and so on). This was mail flowing between two internal components using SMTP as the transport protocol and it was never supposed to fail. If some piece of your internal mail flow fails, it's an internal problem. Bouncing mail on these failures turns internal failures into external ones.

In short: failures of internal mail flows should never produce bounces, even if your internal mail flows are done by having regular mailers send messages back and forth via SMTP. If there is an internal failure, what you want to happen is for the messages involved to be preserved somehow (either frozen in place or moved out of the way). Then when the problem is resolved, you can revive the affected messages and have them continue on (just delayed).

This sounds obvious and you may all be nodding along sagely, but guess what our own mail system doesn't do? Our mail system has internal flows just like the central undergrad email system and all of them are susceptible to this problem. If something goes wrong in our internal mail flow, we too will bounce messages and lose email in the process.

(In addition parts of our email system specify the next-hop flow destination by name instead of by IP address, so we are one DNS issue away from an explosion.)

The embarrassing thing about this for me is that this should not be a new observation. We (and by that I mean 'I') have actually fumbled the internal flow of our mail system in the past, leading to a not insignificant amount of bounced email. But the stupidity of the whole 'should never happen problem in the mail system internals causing user-visible bounces' situation did not strike me at the time for whatever reason.

(I think it's partly because at the time I was thinking of my failure as a general mail system configuration mistake, and it's very hard to avoid significant failures there from causing bounces. Only now did I think about the specifics of a failure during an SMTP-based handoff and why this results in user-visible bounces.)

PS: to make it extremely explicit, I don't think that the people responsible for the central undergraduate email system are stupid for missing this and having email bounce on them. As I mentioned, I missed this too despite having it smack me in the face at one point. This could have been us and more or less was us in the past; that's why it's an uncomfortable lesson.

Written on 15 November 2012.
« The problem with SELinux (still)
Why DTrace does not attract people to Solaris very often »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Nov 15 01:16:30 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.