We get a certain amount of SMTP MAIL FROM's in UTF-8 with odd characters

June 22, 2019

On Twitter, I said:

There sure are a surprising number of places that are trying to send us SMTP MAIL with a MAIL FROM that contains the Unicode character U+FEFF (either 'zero width no-break space' or a byte order mark, apparently, although it's never at the start of the address).

I was looking at the logs on our external mail gateway machine because we use Exim and I was interested to see if we had been poked by anyone trying to exploit CVE-2019-10149. I didn't find anyone trying, but I did turn up these SMTP MAIL FROMs with U+FEFF.

A typical example from today is:

H=(luxuryclass.it) [] rejected MAIL <Antonio<U+FEFF>Smith@luxuryclass.it>

(The '<U+FEFF>' bit is me cutting and pasting from less; less shows Unicode codepoints this way.)

These and similar hijinks have been going on for some time. We have logs going back more than a year, and the earliest hit I can casually turn up is in late May of 2018:

H=(03216a51.newslatest.bid) [] rejected MAIL <NaturalHairCare<U+200B>@newslatest.bid>

(U+200B is a zero width space, so this feels like something similar to the use of U+FEFF.)

In October of 2018, we saw a few uses of U+200E 'left to right mark':

H=(0008ceef.livetofrez.us) [] rejected MAIL <TinnitusRelief<U+200E>@livetofrez.us>

Then at the start of November of 2018 we started seeing U+FEFF, which has taken over as the Unicode codepoint of choice to (ab)use:

H=(office365zakelijk.nl) [] rejected MAIL <Howard<U+FEFF>Smith@office365zakelijk.nl>

We have seen a flood of these since then; they're pervasive in our logs based purely on looking at things in less (someday I will work out how to grep for Unicode codepoints by codepoint value, but that day is not today).

On a quick check, the most recent ones come from IP addresses that are listed in the SBL CSS, as well as any number of other DNS blocklists. I don't really care, since as long as they're helpful enough to put UTF-8 bytes into their MAIL FROM, we'll reject all of their email.

PS: I checked the raw bytes of some of the U+FEFF MAIL FROMs, and they really have the byte sequence 0xEF 0xBB 0xBF that is a true UTF-8 encoded U+FEFF. I'm relatively confident that Exim isn't doing any character mangling on the way through, either, so we're almost certainly seeing what was really on the wire.

Written on 22 June 2019.
« One of the things a metrics system does is handle state for you
Google Groups entirely ignores SMTP time rejections »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jun 22 00:43:14 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.