We get a certain amount of SMTP MAIL FROM's in UTF-8 with odd characters

June 22, 2019

On Twitter, I said:

There sure are a surprising number of places that are trying to send us SMTP MAIL with a MAIL FROM that contains the Unicode character U+FEFF (either 'zero width no-break space' or a byte order mark, apparently, although it's never at the start of the address).

I was looking at the logs on our external mail gateway machine because we use Exim and I was interested to see if we had been poked by anyone trying to exploit CVE-2019-10149. I didn't find anyone trying, but I did turn up these SMTP MAIL FROMs with U+FEFF.

A typical example from today is:

H=(luxuryclass.it) [] rejected MAIL <Antonio<U+FEFF>Smith@luxuryclass.it>

(The '<U+FEFF>' bit is me cutting and pasting from less; less shows Unicode codepoints this way.)

These and similar hijinks have been going on for some time. We have logs going back more than a year, and the earliest hit I can casually turn up is in late May of 2018:

H=(03216a51.newslatest.bid) [] rejected MAIL <NaturalHairCare<U+200B>@newslatest.bid>

(U+200B is a zero width space, so this feels like something similar to the use of U+FEFF.)

In October of 2018, we saw a few uses of U+200E 'left to right mark':

H=(0008ceef.livetofrez.us) [] rejected MAIL <TinnitusRelief<U+200E>@livetofrez.us>

Then at the start of November of 2018 we started seeing U+FEFF, which has taken over as the Unicode codepoint of choice to (ab)use:

H=(office365zakelijk.nl) [] rejected MAIL <Howard<U+FEFF>Smith@office365zakelijk.nl>

We have seen a flood of these since then; they're pervasive in our logs based purely on looking at things in less (someday I will work out how to grep for Unicode codepoints by codepoint value, but that day is not today).

On a quick check, the most recent ones come from IP addresses that are listed in the SBL CSS, as well as any number of other DNS blocklists. I don't really care, since as long as they're helpful enough to put UTF-8 bytes into their MAIL FROM, we'll reject all of their email.

PS: I checked the raw bytes of some of the U+FEFF MAIL FROMs, and they really have the byte sequence 0xEF 0xBB 0xBF that is a true UTF-8 encoded U+FEFF. I'm relatively confident that Exim isn't doing any character mangling on the way through, either, so we're almost certainly seeing what was really on the wire.

Comments on this page:

My theory is that these email addresses appeared in some typeset digital document. Considering that U+FEFF appears between first and last name, or alias and "@", the typesetting system probably inserted these to prevent the email address from being broken across lines by the layout engine. That would look bad and would be error prone for readers, so this is sensible, especially for documents intended to be printed out.

Then spammers scan these documents/websites in digital form for email addresses to abuse. They pick up the typesetting markings and use them literally in the email address, after which they show up at your mail system in this mangled form.

From at 2023-07-28 14:43:00:

Using bash process substitution and echo's backslash escapes, you can pass Unicode characters to grep by codepoint; try out these commands:

(echo -e "test\ufeffing"; echo "no unicode in this line") > /tmp/test

grep $(echo -en "\ufeff") /tmp/test

You'll see it print the line with the U+FEFF in it.

Written on 22 June 2019.
« One of the things a metrics system does is handle state for you
Google Groups entirely ignores SMTP time rejections »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jun 22 00:43:14 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.