We get a certain amount of SMTP MAIL FROM's in UTF-8 with odd characters
On Twitter, I said:
There sure are a surprising number of places that are trying to send us SMTP MAIL with a MAIL FROM that contains the Unicode character U+FEFF (either 'zero width no-break space' or a byte order mark, apparently, although it's never at the start of the address).
I was looking at the logs on our external mail gateway machine
because we use Exim and I was interested to see if we had been poked
by anyone trying to exploit CVE-2019-10149.
I didn't find anyone trying, but I did turn up these SMTP
A typical example from today is:
H=(luxuryclass.it) [18.104.22.168] rejected MAIL <Antonio<U+FEFF>Smith@luxuryclass.it>
<U+FEFF>' bit is me cutting and pasting from
Unicode codepoints this way.)
These and similar hijinks have been going on for some time. We have logs going back more than a year, and the earliest hit I can casually turn up is in late May of 2018:
H=(03216a51.newslatest.bid) [22.214.171.124] rejected MAIL <NaturalHairCare<U+200B>@newslatest.bid>
(U+200B is a zero width space, so this feels like something similar to the use of U+FEFF.)
In October of 2018, we saw a few uses of U+200E 'left to right mark':
H=(0008ceef.livetofrez.us) [126.96.36.199] rejected MAIL <TinnitusRelief<U+200E>@livetofrez.us>
Then at the start of November of 2018 we started seeing U+FEFF, which has taken over as the Unicode codepoint of choice to (ab)use:
H=(office365zakelijk.nl) [188.8.131.52] rejected MAIL <Howard<U+FEFF>Smith@office365zakelijk.nl>
We have seen a flood of these since then; they're pervasive in our logs
based purely on looking at things in
less (someday I will work out how
grep for Unicode codepoints by codepoint value, but that day is not
On a quick check, the most recent ones come from IP addresses that are
listed in the SBL CSS, as well as any
number of other DNS blocklists. I don't really care, since as long as
they're helpful enough to put UTF-8 bytes into their
MAIL FROM, we'll
reject all of their email.
PS: I checked the raw bytes of some of the U+FEFF
MAIL FROMs, and
they really have the byte sequence 0xEF 0xBB 0xBF that is a true
UTF-8 encoded U+FEFF. I'm relatively confident that Exim isn't doing
any character mangling on the way through, either, so we're almost
certainly seeing what was really on the wire.
Comments on this page:Written on 22 June 2019.