Wandering Thoughts archives

2014-06-06

On the Internet, weirdness is generally uncommon

One of the things that my exposure to SMTP daemons and SMTP's oddities has shown me vividly is that perhaps surprisingly, weirdness is uncommon on the practical Internet. Most clients and servers do the usual, common thing. Perhaps 'almost all'. For example, SMTP may contain very dark corners but these corners are also dank and unused, so dank and unused that your MTA may never encounter them.

(I can't find any trace of route addresses in 90 days of our mail gateway's logs of incoming traffic. Senders present quoted local parts infrequently, but they appear to all be spam; we block them all and have never had any reports of problems.)

This practical conservatism is in my view essential for keeping the Internet humming along. The Internet has a certain amount of carefully written software that was programmed by people who had assiduously read all of the relevant standards, and then it has a lot more software that was slapped together by people with various amounts of ignorance. If people used the obscure corners very much, much of the latter software would explode spectacularly. Worse, the burden of implementing Internet software would go up a lot in practice because you could no longer get away with just handling the easy, common cases.

(I'm a pragmatist. An Internet with less software would almost certainly be a smaller Internet. A non-compliant SMTP sender is annoying, but it usually gets the job done for people who are using it.)

The corollary of this is that a lot of Internet software out there probably doesn't handle corner cases or unusual situations very well, either through conscious choice or just because the authors weren't aware of them. There are consequences here both for security and for pragmatic interoperability.

Of course every so often you will stumble over someone who is sending you something from the dark depths. That the Internet is very big means that very uncommon things do happen every so often just through the law of large numbers. I'm sure that somewhere out on the net there are systems exchanging email with route addresses and maybe someday one of them will email us.

(Another corollary is that sooner or later you will see unusual errors, too. For example, we reject a certain amount of email from senders who have accented characters in unquoted local parts of MAIL FROM addresses. This is very RFC non-compliant but not surprising.)

InternetUncommonWeirdness written at 00:49:31; Add Comment

2014-06-05

SMTP's crazy address formats didn't come from nowhere

Broadly speaking, SMTP addresses have two crazy things in them: route addresses and quoted local parts. Route addresses theoretically give you a way of specifying a chain of steps the message is supposed to take on its way to (or from) its eventual destination:

RCPT TO:<@a.ex.org,@barney:user@fred.dibney>

Quoted local parts allow you to use any random characters and character sequences in the local mailbox name:

MAIL FROM:<"abney <abdef> ...%barney"@example.org>

(As I grumbled about yesterday, quoted local addresses drastically increase the complexity of parsing modern SMTP commands.)

Here is the thing: these two features of SMTP addresses did not come from nowhere. When the very first SMTP RFCs were written, these features were necessary. Really.

Quoted local mailbox names have an obvious rationale: they accommodate systems that have local logins (or mailbox names) that do not fit into the simple allowable format that you can use without quoting. The obvious big case that needs this is any local mailbox with a space in the name. Today we don't do that (we tend to use dots), but I'm sure there were systems on the original ARPANet where people had mailbox names of 'Jane Smith' (instead of the Jane.Smith that we'd insist on today). I believe that one of the reasons for this is that people did not want to require a conversion layer in mailers between the true mailbox names (with spaces and funny characters) and the external, RFC-approved mailbox names that could be used in email.

(I can see at least one sensible reason for this: the less software that had to be written to get a system hooked up to ARPANet SMTP, the more likely that it would be and thus that ARPANet SMTP would actually get widely used.)

Equally, route addresses make a lot of sense in an environment where many systems are not directly on the ARPANet and no one has yet built the whole infrastructure of forwarding MTAs, internal versus external mail remapping, and indirect addressing in the form of MX entries. After all, the early SMTP RFCs predate DNS. Here the SMTP RFC is providing a way to directly express multi-hop mail forwarding, something that was a reality on the early ARPANet.

(SMTP route addresses were not the only form this took, of course. The '% hack' used to be very common, where 'a%b@c' implied that c would actually send the message on to a@b. And there were even more complicated fudges for more complex situations.)

Internet email and Internet email addresses are such a juggernaut today that it is easy to forget that once upon a time the world was smaller and SMTP mail was a scrappy upstart proposing a novel and unproven idea, one that had to interoperate with any number of existing systems if it wanted to have any chance of success.

(Note here that I'm talking exclusively of SMTP addresses, not the more complex soup that is how addresses appear in the headers of email messages.)

SMTPAddressOrigins written at 01:58:32; Add Comment

2014-06-04

Why I don't like SMTP command parameters

Modern versions of SMTP have added something called 'command parameters'. These extend the MAIL FROM and RCPT TO commands to add optional parameters to communicate, for example, the rough size of a message that is about to be sent (that's RFC 1870). On the surface these appear perfectly sensible and innocent:

MAIL FROM:<some@address.dom> SIZE=99999

That is, the parameters are tacked on as 'NAME=VALUE' pairs after the address in the MAIL FROM or RCPT TO. Unfortunately this innocent picture starts falling apart once you look at it closely because RFC 5321 addresses are crawling horrors of complexity.

From the example I gave you might think that parsing your MAIL FROM line is simple; just look for the first space and everything after it is parameters. Except that the local name of addresses can be quoted, and when quoted it can contain spaces:

MAIL FROM:<"some person"@a.dom> SIZE=99999

Fine, you say, we'll look for '> '. Guess what quoted parts can also contain?

MAIL FROM:<"some> person"@a.dom> SIZE=99999

Okay, you say, we'll look for the rightmost '> ' in the message. Surely that will do the trick?

MAIL FROM:<person@a.dom> SIZE=99999> BODY=8BITMIME

This is a MAIL FROM line with a perfectly valid address and then a (maliciously) mangled SIZE parameter. You're probably going to reject this client command, but are you going to reject it for the right reason?

What the authors of RFC 5321 have created is a situation where you must do at least a basic parsing of the internal structure of the address just to find out where it ends. Especially in the face of potentially mangled input there is no simple way of determining the end of the address and the start of parameters, despite appearances. Yet the situation looks deceptively simple and a naive parser will work almost all of the time (even quoted local parts are rare, much less ones with wacky characters in them, and my final example is extremely perverse).

I'm sure this was not exactly deliberate on the part of the RFC authors, because after all they're dealing with decades of complex history involving all sorts of baroque possible addressing. From its beginning SMTP was complicated by backwards compatibility requirements and could not, eg, dictate that local mailboxes had to fit into certain restrictions. I'm sure that current RFC authors would like to have thrown all of this away and gone for simple addresses with no quoted local parts and so on. They just couldn't get away with it.

There is a moral in here somewhere but right now I'm too grumpy to come up with one.

(For more background on the various SMTP extensions, see eg the Wikipedia entry.)

PS: note that a semi-naive algorithm may also misinterpret 'MAIL FROM<a@b> SIZE=999>'. After all, it has a '>' right there as the last character.

SMTPParamParsingProblem written at 02:21:12; Add Comment

By day for June 2014: 4 5 6; before June; after June.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.