2022-11-04
An email's Message-ID header isn't a good spam signal (in late 2022)
I recently wrote about maybe copying email anti-spam measures from large places like GMail, using the example of how GMail was rejecting various messages at SMTP time with a reported reason of 'messages missing a valid messageId header are not accepted'. This spurred me into investigating what sort of Message-ID values we saw (which can get complicated to evaluate).
The good news is that Exim actually already logs the Message-ID value for every message in the 'id=' field logged as part of message reception logging. It was still more convenient to add my own logging that called out some specific aspects, but Exim's normal logging meant that I could already do some useful things with our historical data.
The bad news is that it turns out that the Message-ID header isn't a strong signal about whether or not the email was spam, and as part of that GMail is not being entirely honest in their SMTP time rejection messages. In the time when we were doing detailed logging, I saw a reasonable amount of real, desirable email without a Message-ID header at all (including a message to me), and some amount of it with what looked like 'invalid' Message-ID values. There's clearly some real mail sending systems that just don't put in a Message-ID.
As for GMail, once I realized that Exim already had this information, I went back through our logs of email forwarded to GMail. It's true that all of the messages GMail rejected with this SMTP message had missing or questionable Message-ID values. But GMail has also accepted plenty of forwarded email from us that didn't have a Message-ID header. The lack of a Message-ID header by itself is clearly not enough to cause GMail to reject email, which isn't surprising given that some amount of email that people want to get will show up at GMail's door without a Message-ID.
(This GMail behavior does save us from any worries of needing to add our own Message-ID header to any non-spam email being forwarded to GMail.)
Due to Andy Balholm's comment on my previous entry, I also now know that rspamd defaults to giving missing Message-IDs moderate spam points and 'invalid' ones somewhat fewer. A missing Message-ID is MISSING_MID, +2.5 points, and an 'invalid' one is INVALID_MSGID, +1.7 points. You can find this in the rspamd source code in rules/regexp/headers.lua.
(I haven't dug deep enough to figure out what rspamd considers to be 'invalid' here. As I found out, it's complicated even if you try to simplify it.)