Why I don't like SMTP command parameters

June 4, 2014

Modern versions of SMTP have added something called 'command parameters'. These extend the MAIL FROM and RCPT TO commands to add optional parameters to communicate, for example, the rough size of a message that is about to be sent (that's RFC 1870). On the surface these appear perfectly sensible and innocent:

MAIL FROM:<some@address.dom> SIZE=99999

That is, the parameters are tacked on as 'NAME=VALUE' pairs after the address in the MAIL FROM or RCPT TO. Unfortunately this innocent picture starts falling apart once you look at it closely because RFC 5321 addresses are crawling horrors of complexity.

From the example I gave you might think that parsing your MAIL FROM line is simple; just look for the first space and everything after it is parameters. Except that the local name of addresses can be quoted, and when quoted it can contain spaces:

MAIL FROM:<"some person"@a.dom> SIZE=99999

Fine, you say, we'll look for '> '. Guess what quoted parts can also contain?

MAIL FROM:<"some> person"@a.dom> SIZE=99999

Okay, you say, we'll look for the rightmost '> ' in the message. Surely that will do the trick?

MAIL FROM:<person@a.dom> SIZE=99999> BODY=8BITMIME

This is a MAIL FROM line with a perfectly valid address and then a (maliciously) mangled SIZE parameter. You're probably going to reject this client command, but are you going to reject it for the right reason?

What the authors of RFC 5321 have created is a situation where you must do at least a basic parsing of the internal structure of the address just to find out where it ends. Especially in the face of potentially mangled input there is no simple way of determining the end of the address and the start of parameters, despite appearances. Yet the situation looks deceptively simple and a naive parser will work almost all of the time (even quoted local parts are rare, much less ones with wacky characters in them, and my final example is extremely perverse).

I'm sure this was not exactly deliberate on the part of the RFC authors, because after all they're dealing with decades of complex history involving all sorts of baroque possible addressing. From its beginning SMTP was complicated by backwards compatibility requirements and could not, eg, dictate that local mailboxes had to fit into certain restrictions. I'm sure that current RFC authors would like to have thrown all of this away and gone for simple addresses with no quoted local parts and so on. They just couldn't get away with it.

There is a moral in here somewhere but right now I'm too grumpy to come up with one.

(For more background on the various SMTP extensions, see eg the Wikipedia entry.)

PS: note that a semi-naive algorithm may also misinterpret 'MAIL FROM<a@b> SIZE=999>'. After all, it has a '>' right there as the last character.


Comments on this page:

By Ewen McNeill at 2014-06-04 06:51:03:

It seems to me that the problem is actually RFC 5321 (and all the earlier iterations) addresses. They're OMG complex and nearly impossible to parse "properly" because of all the syntaxes they allow to be embedded. About 90% of that complexity is (almost?) completely redundant now, because all the other mail systems being gatewayed to went away, and so really only [\w.\d_+-]+@[\w.\d_-]+ gets real world usage. (If only because there are so many "that's not a valid email address" web validations that actively discourage anything else...)

With almost no loss in modern real world functionality you could refuse to parse addresses with, eg, spaces or ">" in them, and treat anything after either of those as parameters, and anything before as the email address. (IIRC some MTAs already take that approach anyway, refusing anything else.)

I do agree, however, that the "let's just tack some parameters on the end" was an unfortunate design choice. Even if "soft fail" led the designers not to want to introduce new SMTP VERBs. (Despite there being a mechanism to detect if they'd be supported -- eg EHLO vs HELO.)

Ewen

By Chris Nehren at 2014-06-04 09:35:52:

No, the problem goes all the way back to RFC 724. That RFC is the source of the stench and the peril. See https://www.youtube.com/watch?v=JENdgiAPD6c for a humorous explanation of why things are terrible. Yes, it took place at a conference in Asia, but the speaker speaks fluent English and the slides are predominantly in English.

By cks at 2014-06-04 15:07:26:

The thing is that RFC 724 was inevitable at the time, because there really were significant systems on the ARPANet that needed the features it included. The big chance SMTP had to change that was with ESTMP and EHLO; they could have said that if a system EHLO'd it could no longer use quoted local parts or route addresses and be done with it. Old systems on the Internet that still required them would have stuck with HELO'ing.

At this point it's clear that the only way SMTP is going to be simplified is with a new protocol definition. Revisions of SMTP are not going to be able to throw away backwards compatibility. Unfortunately such a new protocol definition is unlikely to ever happen.

Written on 04 June 2014.
« My just-used Go logging idiom and why it is in fact wrong
SMTP's crazy address formats didn't come from nowhere »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jun 4 02:21:12 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.