2014-06-06
On the Internet, weirdness is generally uncommon
One of the things that my exposure to SMTP daemons and SMTP's oddities has shown me vividly is that perhaps surprisingly, weirdness is uncommon on the practical Internet. Most clients and servers do the usual, common thing. Perhaps 'almost all'. For example, SMTP may contain very dark corners but these corners are also dank and unused, so dank and unused that your MTA may never encounter them.
(I can't find any trace of route addresses in 90 days of our mail gateway's logs of incoming traffic. Senders present quoted local parts infrequently, but they appear to all be spam; we block them all and have never had any reports of problems.)
This practical conservatism is in my view essential for keeping the Internet humming along. The Internet has a certain amount of carefully written software that was programmed by people who had assiduously read all of the relevant standards, and then it has a lot more software that was slapped together by people with various amounts of ignorance. If people used the obscure corners very much, much of the latter software would explode spectacularly. Worse, the burden of implementing Internet software would go up a lot in practice because you could no longer get away with just handling the easy, common cases.
(I'm a pragmatist. An Internet with less software would almost certainly be a smaller Internet. A non-compliant SMTP sender is annoying, but it usually gets the job done for people who are using it.)
The corollary of this is that a lot of Internet software out there probably doesn't handle corner cases or unusual situations very well, either through conscious choice or just because the authors weren't aware of them. There are consequences here both for security and for pragmatic interoperability.
Of course every so often you will stumble over someone who is sending you something from the dark depths. That the Internet is very big means that very uncommon things do happen every so often just through the law of large numbers. I'm sure that somewhere out on the net there are systems exchanging email with route addresses and maybe someday one of them will email us.
(Another corollary is that sooner or later you will see unusual
errors, too. For example, we reject a certain amount of email from
senders who have accented characters in unquoted local parts of
MAIL FROM addresses. This is very RFC non-compliant but not
surprising.)
2014-06-05
SMTP's crazy address formats didn't come from nowhere
Broadly speaking, SMTP addresses have two crazy things in them: route addresses and quoted local parts. Route addresses theoretically give you a way of specifying a chain of steps the message is supposed to take on its way to (or from) its eventual destination:
RCPT TO:<@a.ex.org,@barney:user@fred.dibney>
Quoted local parts allow you to use any random characters and character sequences in the local mailbox name:
MAIL FROM:<"abney <abdef> ...%barney"@example.org>
(As I grumbled about yesterday, quoted local addresses drastically increase the complexity of parsing modern SMTP commands.)
Here is the thing: these two features of SMTP addresses did not come from nowhere. When the very first SMTP RFCs were written, these features were necessary. Really.
Quoted local mailbox names have an obvious rationale: they accommodate systems that have local logins (or mailbox names) that do not fit into the simple allowable format that you can use without quoting. The obvious big case that needs this is any local mailbox with a space in the name. Today we don't do that (we tend to use dots), but I'm sure there were systems on the original ARPANet where people had mailbox names of 'Jane Smith' (instead of the Jane.Smith that we'd insist on today). I believe that one of the reasons for this is that people did not want to require a conversion layer in mailers between the true mailbox names (with spaces and funny characters) and the external, RFC-approved mailbox names that could be used in email.
(I can see at least one sensible reason for this: the less software that had to be written to get a system hooked up to ARPANet SMTP, the more likely that it would be and thus that ARPANet SMTP would actually get widely used.)
Equally, route addresses make a lot of sense in an environment where many systems are not directly on the ARPANet and no one has yet built the whole infrastructure of forwarding MTAs, internal versus external mail remapping, and indirect addressing in the form of MX entries. After all, the early SMTP RFCs predate DNS. Here the SMTP RFC is providing a way to directly express multi-hop mail forwarding, something that was a reality on the early ARPANet.
(SMTP route addresses were not the only form this took, of course.
The '% hack' used to be very common, where 'a%b@c' implied that
c would actually send the message on to a@b. And there were
even more complicated fudges for more complex situations.)
Internet email and Internet email addresses are such a juggernaut today that it is easy to forget that once upon a time the world was smaller and SMTP mail was a scrappy upstart proposing a novel and unproven idea, one that had to interoperate with any number of existing systems if it wanted to have any chance of success.
(Note here that I'm talking exclusively of SMTP addresses, not the more complex soup that is how addresses appear in the headers of email messages.)
2014-06-04
Why I don't like SMTP command parameters
Modern versions of SMTP have added something called 'command
parameters'. These extend the MAIL FROM and RCPT TO commands
to add optional parameters to communicate, for example, the rough
size of a message that is about to be sent (that's RFC 1870). On the surface these appear
perfectly sensible and innocent:
MAIL FROM:<some@address.dom> SIZE=99999
That is, the parameters are tacked on as 'NAME=VALUE' pairs after
the address in the MAIL FROM or RCPT TO. Unfortunately this
innocent picture starts falling apart once you look at it closely
because RFC 5321 addresses
are crawling horrors of complexity.
From the example I gave you might think that parsing your MAIL FROM
line is simple; just look for the first space and everything after it
is parameters. Except that the local name of addresses can be quoted,
and when quoted it can contain spaces:
MAIL FROM:<"some person"@a.dom> SIZE=99999
Fine, you say, we'll look for '> '. Guess what quoted parts can
also contain?
MAIL FROM:<"some> person"@a.dom> SIZE=99999
Okay, you say, we'll look for the rightmost '> ' in the message.
Surely that will do the trick?
MAIL FROM:<person@a.dom> SIZE=99999> BODY=8BITMIME
This is a MAIL FROM line with a perfectly valid address and then
a (maliciously) mangled SIZE parameter. You're probably going to
reject this client command, but are you going to reject it for the
right reason?
What the authors of RFC 5321 have created is a situation where you must do at least a basic parsing of the internal structure of the address just to find out where it ends. Especially in the face of potentially mangled input there is no simple way of determining the end of the address and the start of parameters, despite appearances. Yet the situation looks deceptively simple and a naive parser will work almost all of the time (even quoted local parts are rare, much less ones with wacky characters in them, and my final example is extremely perverse).
I'm sure this was not exactly deliberate on the part of the RFC authors, because after all they're dealing with decades of complex history involving all sorts of baroque possible addressing. From its beginning SMTP was complicated by backwards compatibility requirements and could not, eg, dictate that local mailboxes had to fit into certain restrictions. I'm sure that current RFC authors would like to have thrown all of this away and gone for simple addresses with no quoted local parts and so on. They just couldn't get away with it.
There is a moral in here somewhere but right now I'm too grumpy to come up with one.
(For more background on the various SMTP extensions, see eg the Wikipedia entry.)
PS: note that a semi-naive algorithm may also misinterpret 'MAIL
FROM<a@b> SIZE=999>'. After all, it has a '>' right there as the
last character.