An email's Message-ID header isn't a good spam signal (in late 2022)

November 4, 2022

I recently wrote about maybe copying email anti-spam measures from large places like GMail, using the example of how GMail was rejecting various messages at SMTP time with a reported reason of 'messages missing a valid messageId header are not accepted'. This spurred me into investigating what sort of Message-ID values we saw (which can get complicated to evaluate).

The good news is that Exim actually already logs the Message-ID value for every message in the 'id=' field logged as part of message reception logging. It was still more convenient to add my own logging that called out some specific aspects, but Exim's normal logging meant that I could already do some useful things with our historical data.

The bad news is that it turns out that the Message-ID header isn't a strong signal about whether or not the email was spam, and as part of that GMail is not being entirely honest in their SMTP time rejection messages. In the time when we were doing detailed logging, I saw a reasonable amount of real, desirable email without a Message-ID header at all (including a message to me), and some amount of it with what looked like 'invalid' Message-ID values. There's clearly some real mail sending systems that just don't put in a Message-ID.

As for GMail, once I realized that Exim already had this information, I went back through our logs of email forwarded to GMail. It's true that all of the messages GMail rejected with this SMTP message had missing or questionable Message-ID values. But GMail has also accepted plenty of forwarded email from us that didn't have a Message-ID header. The lack of a Message-ID header by itself is clearly not enough to cause GMail to reject email, which isn't surprising given that some amount of email that people want to get will show up at GMail's door without a Message-ID.

(This GMail behavior does save us from any worries of needing to add our own Message-ID header to any non-spam email being forwarded to GMail.)

Due to Andy Balholm's comment on my previous entry, I also now know that rspamd defaults to giving missing Message-IDs moderate spam points and 'invalid' ones somewhat fewer. A missing Message-ID is MISSING_MID, +2.5 points, and an 'invalid' one is INVALID_MSGID, +1.7 points. You can find this in the rspamd source code in rules/regexp/headers.lua.

(I haven't dug deep enough to figure out what rspamd considers to be 'invalid' here. As I found out, it's complicated even if you try to simplify it.)

Comments on this page:

By Slavko at 2022-11-05 08:55:08:

I did rejecting messages without Message-ID some time ago at MTA level. Then i moved to rspamd's force action for that. And then one my user didn't get verification code from alibaba's something... Thus i learned, that is wrong approach to SPAM filtering and removed that rule at all.

Yes, missing MID is mark that something is wrong, but it can be misconfiguration (incompetence) only.

In other words, not all what google does is worth to follow. For me it seems, as they fight with SPAM from wrong end, otherwise i will not reject stupid "You won X milion of €/$/Ł" or "Confirm that this email is alive" messages from them (SPF/DKIM authed) on daily base...

Written on 04 November 2022.
« On not having a separate /boot filesystem on modern (x86) Linux
Our upgrade wave of Ubuntu 18.04 machines has gone fine »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Nov 4 22:16:00 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.