An email's Message-ID header isn't a good spam signal (in late 2022)
I recently wrote about maybe copying email anti-spam measures from large places like GMail, using the example of how GMail was rejecting various messages at SMTP time with a reported reason of 'messages missing a valid messageId header are not accepted'. This spurred me into investigating what sort of Message-ID values we saw (which can get complicated to evaluate).
The good news is that Exim actually already logs the Message-ID value for every message in the 'id=' field logged as part of message reception logging. It was still more convenient to add my own logging that called out some specific aspects, but Exim's normal logging meant that I could already do some useful things with our historical data.
The bad news is that it turns out that the Message-ID header isn't a strong signal about whether or not the email was spam, and as part of that GMail is not being entirely honest in their SMTP time rejection messages. In the time when we were doing detailed logging, I saw a reasonable amount of real, desirable email without a Message-ID header at all (including a message to me), and some amount of it with what looked like 'invalid' Message-ID values. There's clearly some real mail sending systems that just don't put in a Message-ID.
As for GMail, once I realized that Exim already had this information, I went back through our logs of email forwarded to GMail. It's true that all of the messages GMail rejected with this SMTP message had missing or questionable Message-ID values. But GMail has also accepted plenty of forwarded email from us that didn't have a Message-ID header. The lack of a Message-ID header by itself is clearly not enough to cause GMail to reject email, which isn't surprising given that some amount of email that people want to get will show up at GMail's door without a Message-ID.
(This GMail behavior does save us from any worries of needing to add our own Message-ID header to any non-spam email being forwarded to GMail.)
Due to Andy Balholm's comment on my previous entry, I also now know that rspamd defaults to giving missing Message-IDs moderate spam points and 'invalid' ones somewhat fewer. A missing Message-ID is MISSING_MID, +2.5 points, and an 'invalid' one is INVALID_MSGID, +1.7 points. You can find this in the rspamd source code in rules/regexp/headers.lua.
(I haven't dug deep enough to figure out what rspamd considers to be 'invalid' here. As I found out, it's complicated even if you try to simplify it.)
(Maybe) copying email anti-spam measures from Google and company
For a while now, Google has been rejecting some messages we try to forward to GMail with a SMTP error message like this:
Messages missing a valid messageId header are not accepted.
You can have a number of reactions to this. One of them is to be grumpy that Google is rejecting email that's otherwise (probably) perfectly valid and perhaps not even spam. Well, let's be honest here; all competent modern mail system operators reject email at SMTP time for all sorts of peculiar reasons, so I can hardly pick on GMail for not liking messages without message IDs when we will reject your messages if they an attachment type we don't like or ClamAV matches a signature.
Another reaction, one that I'm more and more leaning toward, is to consider making our email system reject external email at SMTP time for the same reason. Why? Because if GMail is doing it, a missing (or invalid) message ID is probably a good sign of spam. The people running GMail don't just roll out of bed one day, pick an RFC header requirement at random, and start rejecting email that violates it. Instead it seems very likely that they have a bunch of data that shows that rejecting email this way is a good idea.
(Of course we don't actually know if GMail is rejecting the email for this reason alone. There could be other signals involved that GMail isn't putting in the SMTP rejection message for various reasons.)
More broadly, I'm increasingly coming to think that major email providers have a lot more data on spam signs than we do, so we might as well take advantage of their work when possible. If they give us a relatively clear signal that they consider something a spam signature, maybe we should use that signal ourselves. At the very least it's probably worth investigating, for example to see how many messages have invalid or outright missing message IDs, and what happens to them.
(It's possible that rspamd can already recognize and log bad or missing message-ids, but if so I can't find it in the documentation on a casual search.)
An email phish attempt using attachment file type confusion
I don't get much spam email in general and I get even less that has malware payloads, so in one sense it's always interesting when one makes it through our various anti-spam measures and I get to actually look at a sample for myself. Today I received what looked like a malware attack using a PDF:
Subject: [...] has sent you a document(s)
File Name: Invoice-38937.pdf
File Size: 44 KB
Please find attached Invoice-38937 for your reference.
I was all ready to start cracking the PDF open with various tools to see what they could tell me, when I actually extracted the attachment and looked at the full filename and file type:
Content-Type: application/octet-stream; name="Invoice-38937.shtml"
The actual attachment was a HTML file that contained a single form that POST'ed off to a website, with a fixed 'Email address' field and a password field for you to fill in. The HTML design was set up to try to look plausible as a PDF that you had to enter a password to see, with a blurred, dark background image that looked sort of like a blurry invoice and an 'Adobe PDF / Sign in to view invoice payment' popup, a page title of 'Adobe ID', and so on.
(The form's POST target was a HTTP URL instead of a HTTPS one, but I think only Firefox warns you about that.)
At one level this is unexceptional and probably unsurprising. At another level, I find it interesting that this sort of attachment file type confusion actually works (or at least I assume it works enough for spammers to keep using it). It wouldn't work in the mail environment I use, where a completely visually different program is run to display a PDF than is run to display a HTML file, but in an 'all in one' environment where the mail client tries to display as much as it can itself (and where browsers display PDFs too), I can see how there might not be clear visible signs that you're not really looking at a PDF.
To me, this also points out a weakness in common mail environments. This file type confusion shouldn't really work; you shouldn't be able to pass off a HTML file as a PDF (although PDFs can contain plenty of dangerous things in their own right). You could also argue that a HTML file opened directly in a mail client shouldn't be allowed to submit any forms, but there are probably people who actually rely on this working for some internal process they do.
(Email attachment file type confusion is routinely exploited by malware to try to, for example, persuade you that an executable is a PDF so you'll click on it.)
DKIM signature types (algorithms) that we see (as of July 2022)
A lot of email these days is signed with DKIM, partly because signing email with DKIM is increasingly mandatory in practice. But 'signed with DKIM' is a broad category because DKIM has more than one signing algorithm and on top of that is used with (public) keys of different lengths.
What signing algorithms DKIM supports in practice is a matter for some discussion. The initial DKIM RFCs, such as RFC 6376, support rsa-sha1 and rsa-sha256. RFC 8301 deprecates rsa-sha1 and says that it shouldn't be used (and that a message with only a rsa-sha1 DKIM signature should be considered to fail validation). RFC 8301 also says RSA keys must be at least 1024 bits long and should be at least 2048 bits; again, messages with too-small keys should be considered to fail validation. RFC 8463 defines Ed25519 based DKIM keys, but apparently very few big providers actually support them, which makes them relatively pointless and useless in practice. Probably the most broadly useful algorithm and key length is rsa-sha256 with 2048 bit keys.
Over the past ten full days, our central mail server has seen almost 85,000 DKIM signatures on 75,100 messages (a single message can have multiple DKIM signatures). Over the same time the machine received about 96,000 messages (7,000 of them internally generated by users and machines here). Signature algorithms break down as follows:
44074 a=rsa-sha256 b=1024 37865 a=rsa-sha256 b=2048 7141 a=rsa-sha1 b=1024 311 a=rsa-sha1 b=2048 18 a=rsa-sha256 b=1016 8 a=rsa-sha256 b=768 5 a=rsa-sha256 b=1032 4 a=rsa-sha256 b=4096 3 a=rsa-sha1 b=4096 1 a=rsa-sha256 b=3072 1 a=rsa-sha1 b=768 1 a=rsa-sha1 b=2056
If I look only at verified signatures, the numbers are a bit different:
40270 a=rsa-sha256 b=1024 32221 a=rsa-sha256 b=2048 1880 a=rsa-sha1 b=1024 205 a=rsa-sha1 b=2048 5 a=rsa-sha256 b=768 4 a=rsa-sha256 b=1032 3 a=rsa-sha1 b=4096 1 a=rsa-sha256 b=4096 1 a=rsa-sha1 b=768 1 a=rsa-sha1 b=2056
(Despite RFC 8301, Exim remains willing to verify DKIM signatures using either or both of rsa-sha1 and keys under 1024 bits.)
The largest shrinkage is in 1024-bit rsa-sha1. Since our central mail server sees messages after their subject line may have been marked as spam, some of this drop may be spammers using 1024-bit rsa-sha1. In general our external SMTP gateway sees significantly fewer 'headers probably modified' verification mismatches than our central mail server does. But even our external SMTP gateway sees about 4,400 'headers probably modified' mismatches over the same ten day period.
(And even on our central mail server about 74,600 DKIM signatures across about 62,200 email messages did verify. So a lot of our email does have good DKIM signatures.)
PS: It's a deliberate more or less design decision that if we think a message is spam, we break the DKIM signature by tagging the Subject with a marker. Us tagging the Subject predates any widespread use of DKIM and people here expect it, but when DKIM started to be a thing we (I) thought about it and decided that this was a feature.
Signing email with DKIM is becoming increasing mandatory in practice
For our sins, we forward a certain amount of email to GMail (which is to say that it's sent to addresses here and then we send it onward to GMail). These days, GMail rejects a certain amount of that email at SMTP time with a message that some people will find very familiar:
550-5.7.26 This message does not have authentication information or fails to pass authentication checks (SPF or DKIM). [...]
(They helpfully include a link to their help section on "Make sure your messages are authenticated".)
As far as we can see from outside, there are two ways to pass this authentication requirement. First, the sending IP can be covered by actively positive SPF authorization, such as a '+a' clause. GMail actively ignores '~all', so I suspect that they also ignore '+all'. Second, you can DKIM sign your messages.
There are people who don't like email forwarding, but I can assure them that it definitely happens, possibly still a lot. Unless you want your email not to be accepted by GMail when forwarded, this means you need to DKIM sign it, because forwarded email won't pass SPF (and no, the world won't implement SRS).
GMail is not the only large email provider, but they are one of the influential ones. Where GMail goes today, others are likely to follow soon enough, if they haven't already. And even if other providers (or GMail) accept the message at SMTP time, they might use something similar to these requirements as part of deciding whether or not to file the new message away as spam.
I'm not really fond of the modern mail environment and how complex it's become. But it is what it is, so we get to live with it. If your mail system is capable of DKIM signing messages but you're not doing so yet, you should probably start. If your mailer can't DKIM sign messages, you probably need to look into fixing that in one way or another.
(We're lucky in that we're DKIM signing locally generated messages, and unlucky in that we do forward messages and so we're trying to figure out what we can do to help when the message isn't DKIM signed.)
Appending: The uncertainty of SRS and GMail
SPF's usual answer to how it breaks forwarding messages is SRS. However, it's not clear that SRS or any other scheme of rewriting just the envelope sender will pass GMail's SMTP authentication checks, because GMail's help specifically says (with emphasis mine):
For SPF and DKIM to authenticate a message, the message From: header must match the sending domain. Messages must pass either the SPF or the DKIM check to be authenticated.
SRS and similar schemes normally rewrite the envelope sender but not the message From:, and so would not pass what GMail says is their check (whether it actually is, who knows). Effectively GMail is insisting on DMARC alignment even without DMARC in the picture.
We need a way to scan Microsoft Office files for malware
For reasons beyond the scope of this entry, for the past couple of years I've been running a large commercial anti-spam system (and its malware recognition) side by side with what we could put together with ClamAV and some low-cost commercial ClamAV signature sources. Since the commercial anti-spam system is on the way out, one of the things I keep an eye on is what it detects as malware that ClamAV misses (and then I try to figure out if there's some message signature we can use to block it, like a .scr file inside a .7z attachment). More or less from the beginning and continuing on through the last time I mentioned this, one significant area where the commercial system is better is detecting bad stuff in Microsoft Office files.
(The commercial system has also picked up stuff in PDFs that ClamAV doesn't. In general it feels like it's better at finding bad stuff in complex and nested file formats, but I haven't looked at this closely.)
With the end of service life of the commercial software getting closer and closer, my feelings that we should actively try to do something about this are getting bigger and bigger. We unfortunately can't completely block Microsoft Office macros (some of our users do get legitimate email with them included), which I understand are one of the big vectors, but there are probably others. As far as I know, the only good open source tool for scanning Microsoft Office files is the oletools Python package, and conveniently we're already scanning email with a Python program.
Oletools has some support for identifying Microsoft Office files with 'bad stuff', but I believe it's partly in the form of a command line tool, mraptor, which has no API documentation for using it as a package. Now that I look more closely, there's also oleid and olevba. The command line tools don't look like they have an output format that's good for script usage, although I not be looking closely enough at their options. If people have wrapped these up in canned tools to scan an attachment and give you an indicator of how bad it is, I can't find such tools in some Internet searches.
Right now one issue is the same one we had with attachment types, where we didn't know what sort of attachments our users got, both in legitimate email and in spam. Today we don't know what sorts of things are in the Microsoft Office files our users receive. How prevalent are macros, embedded OLE objects, macros with suspicious attributes, and so on? Since it seems unlikely we'll be able to get a Microsoft Office scanning tool (either open source or commercial) that gives us a carefully curated 'good' or 'bad' answer, we're going to have to work that out based on our usage patterns, and that means learning what the usage patterns are.
So probably the first thing I need to do is make our attachment scanning program more complicated by having it use oletools to analyze Microsoft Office files and record information about them, just as we record file extension information for files in archives.
(I would dearly love to be able to pay for this from someone, but as far as I know there's nothing. Paying other people for malware detection is in my opinion better than trying to do it myself, partly because I'm never going to be a full time specialist at this and there's some chance that people we pay will be.)
Some things on strict and relaxed DKIM alignment in DMARC
To simplify, DMARC primarily works by verifying that messages have a DKIM signature that matches their From: domain. There are two modes for this matching. In 'strict DKIM identifier alignment', the From: domain and the DKIM domain must match exactly; if you send with a From: of news.example.com, only a DKIM signature from news.example.com will match (other DKIM signatures may be present but will be ignored by DMARC). In 'relaxed DKIM identifier alignment', which is the default, any DKIM signature from example.com will work; it could still be news.example.com, but it could also be 'example.com' or 'mta-group.example.com'.
The advantage of relaxed alignment is that it makes operation of a central mail sending infrastructure easier (or more generally, mail sending infrastructure that's somewhat detached from the people using it). One group can run outgoing mail, sign everything as 'example.com', and the marketing department doesn't have to bug them for special configuration changes when they want to create 'news.example.com' and start using it (or at least, not as many). If another group sets up special mail-out infrastructure that the marketing department will use, nothing much has to change, since the new group can set up their own DKIM keys and start signing as 'bulk-mta.example.com'. DMARC will be happy all around.
The disadvantage of relaxed alignment is that anyone in your organization who runs their own mail server can send email that passes DMARC for anything in your organization, whether or not they're supposed to use that From: address. Perhaps the marketing department is only supposed to send email as From: news.example.com, but once they have a DKIM key, relaxed alignment will let them send as From: example.com, or support.example.com, or whatever. This also applies to any third party mail sending service that you've delegated DKIM keys to. If marketing has hired MailService to send email as 'newsblast.example.com' and has had you add CNAMEs to MailService's DKIM keys in that subdomain, MailService (or anyone who compromises them) can use those DKIM keys to send DMARC-validated email that is From: example.com itself, or From: 'security.example.com', and so on.
If you have an organization that is either small or quite centralized or both, relaxed alignment may make your job easier, especially if people create (and perhaps remove) a lot of From: domain and host names as projects come and go. The central mail people can just sign everything as 'example.com' and be done with it, without needing to keep track of what has DKIM selectors and what they are and so on. Relaxed alignment also makes it easier to transition from plain DKIM (where the DKIM domain mostly identifies the sending mail server) to DMARC, since all of your mail servers will be using a DKIM domain of <something>.example.com, and all of those pass DMARC for any From: in example.com.
Another way to put it is that relaxed alignment decouples DKIM keys and subdomains from DMARC validation as long as they're all within your organizational domain (such as 'example.com'). Your MTA people can have their own naming scheme for the choice of DKIM signing domains and DKIM keys, and your mail sending users can pick their From: addresses independently of that. You can readily have different outgoing MTAs that people pick between based on various circumstances, possibly including things like geographic or network location.
If you have a large, highly distributed organization with fairly autonomous units, such as a large university, relaxed alignment becomes somewhat alarming. Sub-groups will have their own email sending infrastructure with its own DKIM keys, and if they don't carefully restrict what From: addresses they allow and just sign more or less anything that passes through them, you've just given people with access to 'dept.example.edu' the ability to send DMARC valid email with a From: of 'firstname.lastname@example.org' or 'chair@deptB.example.edu'. You may not want that. This is the downside of that exact same decoupling of DKIM keys and DMARC validation that we had before,
Some versions of this may not even be malicious, just have undesirable consequences. The publicity group of dept.example.edu may have hired MailService to send out mail blasts that are normally from 'news.dept.example.edu' (and have DKIM keys set up for it), but now they want to send out a special blast using 'email@example.com'. This will pass DMARC with the DKIM CNAMEs that MailService and the publicity group already have, and if receivers object to it, it may contaminate the reputation of '@example.edu' generally. With strict alignment, you force the publicity group to slow down and talk to someone before they execute this clever idea.
(Whether or not MailService would flag or block this (with relaxed alignment) is an interesting question. After all, your own DMARC policies say that this is okay, and maybe your organizational policies are fine with it.)
Notes on using DKIM in a DMARC world
By itself, DKIM simply
creates an attestation that some domain (or host) has touched an
email message, in the form of a DKIM signature that names that
domain (really a DNS name) in its '
d=' parameter. If you have an
email server that handles (outgoing) email for a bunch of host and
domain names, and you think of yourself as primarily one of them,
say, 'cs.toronto.edu', then you
can have your email server generate DKIM signatures using this
primary domain regardless of which one of your assorted historical
and current domains someone is using for their email today. You can
even sign email that passes through you that is from other, outside
domains to attest that it genuinely came through you, if you want.
(You may not want to sign other people's email for social reasons, since a DKIM signature may be seen as taking responsibility for it and you may be forwarding unwanted email, but DKIM itself considers this perfectly valid. Messages can and not infrequently do have multiple DKIM signatures for the various parties that are associated with the email or that have touched it in processing. I put together some statistics on this in late 2018 with a bit more in 2020.)
This doesn't work so well once you throw in DMARC.
DMARC is specifically concerned with validating the domain in the
From: header address, so it wants to find a DKIM signature with
d=' that matches that domain. Well, sort of. As covered in
RFC 7489 section 3.1.1, it's
possible to require only that the 'organizational domain' matches,
not that there is an exact match. This is called 'relaxed DKIM
identifier alignment' (as opposed to 'strict' mode), and I believe
it's the default. If I'm
understanding relaxed alignment correctly, then DMARC would accept
a DKIM signature with 'd=cs.toronto.edu' for a From: subdomain of
'<any>.toronto.edu', and I think even 'toronto.edu' itself.
(However, it wouldn't be accepted for a From: of cs.utoronto.ca, since the organizational domains differ.)
If you have multiple historical and current subdomains and domains that are used for outgoing email (as we do), the safest thing to do is to always DKIM sign for the specific subdomain used in the From: of the current message. You don't need to use different DKIM keys unless you want to (it will probably be simpler not to) and you can reuse the same DKIM selector name, but each (sub) domain will need a DNS record for the selector you're using. The simple approach is to make them all DNS CNAMEs to the selector record in your primary (sub)domain. This gives you advance protection against any need or desire of people to switch your DMARC over to strict DKIM identifier alignment.
(The implications of strict versus relaxed DKIM identifier alignment are something for another entry, but the more I think about it, the more I think we're going to wind up with strict alignment sooner or later.)
Because I looked it up, DMARC policies are checked in DNS on the specific subdomain in the From: and on the organizational domain (if they're different), but not on any intermediate subdomains. So if you have, for example, 'teach.cs.toronto.edu', its DMARC policies will be looked up on it and on toronto.edu, but not on cs.toronto.edu. This applies equally if the From: 'domain' is really a host name. If you send out email using lots of individual host names and you have to use strict DKIM identifier alignment, you're probably not going to enjoy it (unless all of the DNS provisioning and mailer configuration is automated).
PS: We did start DKIM signing our email, using a single DKIM domain for everything because that's by far the simplest solution in Exim and DMARC wasn't on our minds until, basically, right now. Now that we're dealing with DMARC (for reasons beyond the scope of this entry), We're going to have to change our DKIM signing a bit so it looks at the From: domain and is more specific.
Understanding what a DKIM (spam) replay attack is
I recently read A breakdown of a DKIM replay attack (via), which introduced me to the idea of a DKIM (spam) replay attack. In a DKIM spam replay attack, an attacker arranges to somehow send one or more messages with spam content through your system, and then saves the full message, complete with your DKIM signature. Once they have this single copy, they can use other SMTP servers to (re)send it to all sorts of recipients, since in SMTP and in mailers in general, the recipients come from (unsigned) envelope information, not the (signed and thus unchangeable) message.
As Protonmail notes, the damage is made worse if the attacker can
somehow persuade you to create a DKIM signature that doesn't cover
To:, for example by omitting
them from the initial message they send. If the DKIM signature
doesn't cover these headers for whatever reason, the attacker can
add them after the fact and the message will still pass DKIM
validation, and mail clients (and mail systems) will probably not
flag that the message Subject and other things being shown to people
is not actually signed. The attacker can also add an additional
Subject: header (or other headers) to see if the recipient's overall
mail system validates the DKIM signature with one but shows the
DKIM signatures can be made over missing headers, which can be used
to 'seal' certain headers so that additional versions of them can't
be added. When I experimented with our Exim
setup, which uses default Exim DKIM parameters,
it did sign missing
To: headers, effectively sealing
them, but it doesn't currently seal any headers against additions.
(Exim takes its default header list to sign from RFC 4871. That's been obsoleted
by RFC 6376, but
our Ubuntu 18.04 version of Exim is definitely using the RFC 4871
list, not the RFC 6376 list, since it signs including headers like
Message-ID:, and the MIME headers.)
Finding out about DKIM replay attacks has made me consider what we might do about them. Right now I can't think of very much we could do (although I can think of a certain amount of clever ideas for bigger, more complex places with more infrastructure). However, perhaps we should have a second set of DKIM keys pre-configured into our DNS and ready to go live, so that we can switch at the drop of a hat if we ever have to (well, with a simple configuration file change).
(I think that rotating your DKIM keys regularly might help to some
degree, but my assumption is that someone who manages to get your
to DKIM sign a bad message is most likely going to start their mass
sending activities almost immediately. If nothing else, the longer
they wait the more out of place the message's (signed)
header will look.)
Sadly, my experience is that big commercial anti-malware detection is better
For reasons beyond the scope of this entry, for the past couple of years I've been running a large commercial anti-spam system (and its malware recognition) side by side with what we could put together with ClamAV and some low-cost commercial ClamAV signature sources. More or less from the beginning it's been clear to me that our commercial system was recognizing malware that ClamAV was not. Some of this was new things that we could add to our manual recognition and rejection, but at this point another significant source of missed ClamAV recognition is (still) malware in Microsoft Office files.
This is not really a result that I was hoping for. Our commercial anti-spam system has been on vendor life support for more than a year, so its recognition engine definitely isn't being updated for new capabilities and who knows how much its signature database is being updated. Despite that, it's still ahead of a well regarded open source malware detection system.
Some amount of bad email makes it through both ClamAV and our commercial anti-spam system and is then forwarded on to elsewhere by some of our users. These days, that elsewhere includes both Office365 and GMail. Trawling our logs suggests that both of these recognize and reject even more malware than we do, although this effect is somewhat entangled in them also recognizing more spam than we do.
This is not really surprising. Large providers of email and of anti-spam services have more resources for both improving their scanning engines and coming up with signatures and danger signs. They see more email (one way or another) and can build more sophisticated systems to analyze it in various ways. Greater volume with automated analysis and feedback systems can mean faster responses to new malware. It's not really surprising that the open source and small commercial firms can't match this.
(One suggestive thing is that our commercial anti-spam software provider is not getting out of the anti-spam business. Instead, it's moving to having only a cloud filtering option, where you run your incoming email through their cloud systems. This gives them far more aggregate visibility into potential malware and makes responding to it much faster. I suspect that they were pushed to this partly to match the malware filtering quality of the big providers like Google and Microsoft.)
PS: For Microsoft Office files specifically, it might be possible for us to build something using oletools, and we may have to try to, just to not let too much bad stuff through once we can no longer use the commercial anti-spam software.
(This is one unhappy aspect of how running your own email is increasingly an artisanal choice. It's possible that a lot of manual tuning and adjustment and software will get us to something close to the quality of big commercial providers, but it's unlikely to be easy.)