We need a way to scan Microsoft Office files for malware
For reasons beyond the scope of this entry, for the past couple of years I've been running a large commercial anti-spam system (and its malware recognition) side by side with what we could put together with ClamAV and some low-cost commercial ClamAV signature sources. Since the commercial anti-spam system is on the way out, one of the things I keep an eye on is what it detects as malware that ClamAV misses (and then I try to figure out if there's some message signature we can use to block it, like a .scr file inside a .7z attachment). More or less from the beginning and continuing on through the last time I mentioned this, one significant area where the commercial system is better is detecting bad stuff in Microsoft Office files.
(The commercial system has also picked up stuff in PDFs that ClamAV doesn't. In general it feels like it's better at finding bad stuff in complex and nested file formats, but I haven't looked at this closely.)
With the end of service life of the commercial software getting closer and closer, my feelings that we should actively try to do something about this are getting bigger and bigger. We unfortunately can't completely block Microsoft Office macros (some of our users do get legitimate email with them included), which I understand are one of the big vectors, but there are probably others. As far as I know, the only good open source tool for scanning Microsoft Office files is the oletools Python package, and conveniently we're already scanning email with a Python program.
Oletools has some support for identifying Microsoft Office files with 'bad stuff', but I believe it's partly in the form of a command line tool, mraptor, which has no API documentation for using it as a package. Now that I look more closely, there's also oleid and olevba. The command line tools don't look like they have an output format that's good for script usage, although I not be looking closely enough at their options. If people have wrapped these up in canned tools to scan an attachment and give you an indicator of how bad it is, I can't find such tools in some Internet searches.
Right now one issue is the same one we had with attachment types, where we didn't know what sort of attachments our users got, both in legitimate email and in spam. Today we don't know what sorts of things are in the Microsoft Office files our users receive. How prevalent are macros, embedded OLE objects, macros with suspicious attributes, and so on? Since it seems unlikely we'll be able to get a Microsoft Office scanning tool (either open source or commercial) that gives us a carefully curated 'good' or 'bad' answer, we're going to have to work that out based on our usage patterns, and that means learning what the usage patterns are.
So probably the first thing I need to do is make our attachment scanning program more complicated by having it use oletools to analyze Microsoft Office files and record information about them, just as we record file extension information for files in archives.
(I would dearly love to be able to pay for this from someone, but as far as I know there's nothing. Paying other people for malware detection is in my opinion better than trying to do it myself, partly because I'm never going to be a full time specialist at this and there's some chance that people we pay will be.)
Some things on strict and relaxed DKIM alignment in DMARC
To simplify, DMARC primarily works by verifying that messages have a DKIM signature that matches their From: domain. There are two modes for this matching. In 'strict DKIM identifier alignment', the From: domain and the DKIM domain must match exactly; if you send with a From: of news.example.com, only a DKIM signature from news.example.com will match (other DKIM signatures may be present but will be ignored by DMARC). In 'relaxed DKIM identifier alignment', which is the default, any DKIM signature from example.com will work; it could still be news.example.com, but it could also be 'example.com' or 'mta-group.example.com'.
The advantage of relaxed alignment is that it makes operation of a central mail sending infrastructure easier (or more generally, mail sending infrastructure that's somewhat detached from the people using it). One group can run outgoing mail, sign everything as 'example.com', and the marketing department doesn't have to bug them for special configuration changes when they want to create 'news.example.com' and start using it (or at least, not as many). If another group sets up special mail-out infrastructure that the marketing department will use, nothing much has to change, since the new group can set up their own DKIM keys and start signing as 'bulk-mta.example.com'. DMARC will be happy all around.
The disadvantage of relaxed alignment is that anyone in your organization who runs their own mail server can send email that passes DMARC for anything in your organization, whether or not they're supposed to use that From: address. Perhaps the marketing department is only supposed to send email as From: news.example.com, but once they have a DKIM key, relaxed alignment will let them send as From: example.com, or support.example.com, or whatever. This also applies to any third party mail sending service that you've delegated DKIM keys to. If marketing has hired MailService to send email as 'newsblast.example.com' and has had you add CNAMEs to MailService's DKIM keys in that subdomain, MailService (or anyone who compromises them) can use those DKIM keys to send DMARC-validated email that is From: example.com itself, or From: 'security.example.com', and so on.
If you have an organization that is either small or quite centralized or both, relaxed alignment may make your job easier, especially if people create (and perhaps remove) a lot of From: domain and host names as projects come and go. The central mail people can just sign everything as 'example.com' and be done with it, without needing to keep track of what has DKIM selectors and what they are and so on. Relaxed alignment also makes it easier to transition from plain DKIM (where the DKIM domain mostly identifies the sending mail server) to DMARC, since all of your mail servers will be using a DKIM domain of <something>.example.com, and all of those pass DMARC for any From: in example.com.
Another way to put it is that relaxed alignment decouples DKIM keys and subdomains from DMARC validation as long as they're all within your organizational domain (such as 'example.com'). Your MTA people can have their own naming scheme for the choice of DKIM signing domains and DKIM keys, and your mail sending users can pick their From: addresses independently of that. You can readily have different outgoing MTAs that people pick between based on various circumstances, possibly including things like geographic or network location.
If you have a large, highly distributed organization with fairly autonomous units, such as a large university, relaxed alignment becomes somewhat alarming. Sub-groups will have their own email sending infrastructure with its own DKIM keys, and if they don't carefully restrict what From: addresses they allow and just sign more or less anything that passes through them, you've just given people with access to 'dept.example.edu' the ability to send DMARC valid email with a From: of 'email@example.com' or 'chair@deptB.example.edu'. You may not want that. This is the downside of that exact same decoupling of DKIM keys and DMARC validation that we had before,
Some versions of this may not even be malicious, just have undesirable consequences. The publicity group of dept.example.edu may have hired MailService to send out mail blasts that are normally from 'news.dept.example.edu' (and have DKIM keys set up for it), but now they want to send out a special blast using 'firstname.lastname@example.org'. This will pass DMARC with the DKIM CNAMEs that MailService and the publicity group already have, and if receivers object to it, it may contaminate the reputation of '@example.edu' generally. With strict alignment, you force the publicity group to slow down and talk to someone before they execute this clever idea.
(Whether or not MailService would flag or block this (with relaxed alignment) is an interesting question. After all, your own DMARC policies say that this is okay, and maybe your organizational policies are fine with it.)
Notes on using DKIM in a DMARC world
By itself, DKIM simply
creates an attestation that some domain (or host) has touched an
email message, in the form of a DKIM signature that names that
domain (really a DNS name) in its '
d=' parameter. If you have an
email server that handles (outgoing) email for a bunch of host and
domain names, and you think of yourself as primarily one of them,
say, 'cs.toronto.edu', then you
can have your email server generate DKIM signatures using this
primary domain regardless of which one of your assorted historical
and current domains someone is using for their email today. You can
even sign email that passes through you that is from other, outside
domains to attest that it genuinely came through you, if you want.
(You may not want to sign other people's email for social reasons, since a DKIM signature may be seen as taking responsibility for it and you may be forwarding unwanted email, but DKIM itself considers this perfectly valid. Messages can and not infrequently do have multiple DKIM signatures for the various parties that are associated with the email or that have touched it in processing. I put together some statistics on this in late 2018 with a bit more in 2020.)
This doesn't work so well once you throw in DMARC.
DMARC is specifically concerned with validating the domain in the
From: header address, so it wants to find a DKIM signature with
d=' that matches that domain. Well, sort of. As covered in
RFC 7489 section 3.1.1, it's
possible to require only that the 'organizational domain' matches,
not that there is an exact match. This is called 'relaxed DKIM
identifier alignment' (as opposed to 'strict' mode), and I believe
it's the default. If I'm
understanding relaxed alignment correctly, then DMARC would accept
a DKIM signature with 'd=cs.toronto.edu' for a From: subdomain of
'<any>.toronto.edu', and I think even 'toronto.edu' itself.
(However, it wouldn't be accepted for a From: of cs.utoronto.ca, since the organizational domains differ.)
If you have multiple historical and current subdomains and domains that are used for outgoing email (as we do), the safest thing to do is to always DKIM sign for the specific subdomain used in the From: of the current message. You don't need to use different DKIM keys unless you want to (it will probably be simpler not to) and you can reuse the same DKIM selector name, but each (sub) domain will need a DNS record for the selector you're using. The simple approach is to make them all DNS CNAMEs to the selector record in your primary (sub)domain. This gives you advance protection against any need or desire of people to switch your DMARC over to strict DKIM identifier alignment.
(The implications of strict versus relaxed DKIM identifier alignment are something for another entry, but the more I think about it, the more I think we're going to wind up with strict alignment sooner or later.)
Because I looked it up, DMARC policies are checked in DNS on the specific subdomain in the From: and on the organizational domain (if they're different), but not on any intermediate subdomains. So if you have, for example, 'teach.cs.toronto.edu', its DMARC policies will be looked up on it and on toronto.edu, but not on cs.toronto.edu. This applies equally if the From: 'domain' is really a host name. If you send out email using lots of individual host names and you have to use strict DKIM identifier alignment, you're probably not going to enjoy it (unless all of the DNS provisioning and mailer configuration is automated).
PS: We did start DKIM signing our email, using a single DKIM domain for everything because that's by far the simplest solution in Exim and DMARC wasn't on our minds until, basically, right now. Now that we're dealing with DMARC (for reasons beyond the scope of this entry), We're going to have to change our DKIM signing a bit so it looks at the From: domain and is more specific.
Understanding what a DKIM (spam) replay attack is
I recently read A breakdown of a DKIM replay attack (via), which introduced me to the idea of a DKIM (spam) replay attack. In a DKIM spam replay attack, an attacker arranges to somehow send one or more messages with spam content through your system, and then saves the full message, complete with your DKIM signature. Once they have this single copy, they can use other SMTP servers to (re)send it to all sorts of recipients, since in SMTP and in mailers in general, the recipients come from (unsigned) envelope information, not the (signed and thus unchangeable) message.
As Protonmail notes, the damage is made worse if the attacker can
somehow persuade you to create a DKIM signature that doesn't cover
To:, for example by omitting
them from the initial message they send. If the DKIM signature
doesn't cover these headers for whatever reason, the attacker can
add them after the fact and the message will still pass DKIM
validation, and mail clients (and mail systems) will probably not
flag that the message Subject and other things being shown to people
is not actually signed. The attacker can also add an additional
Subject: header (or other headers) to see if the recipient's overall
mail system validates the DKIM signature with one but shows the
DKIM signatures can be made over missing headers, which can be used
to 'seal' certain headers so that additional versions of them can't
be added. When I experimented with our Exim
setup, which uses default Exim DKIM parameters,
it did sign missing
To: headers, effectively sealing
them, but it doesn't currently seal any headers against additions.
(Exim takes its default header list to sign from RFC 4871. That's been obsoleted
by RFC 6376, but
our Ubuntu 18.04 version of Exim is definitely using the RFC 4871
list, not the RFC 6376 list, since it signs including headers like
Message-ID:, and the MIME headers.)
Finding out about DKIM replay attacks has made me consider what we might do about them. Right now I can't think of very much we could do (although I can think of a certain amount of clever ideas for bigger, more complex places with more infrastructure). However, perhaps we should have a second set of DKIM keys pre-configured into our DNS and ready to go live, so that we can switch at the drop of a hat if we ever have to (well, with a simple configuration file change).
(I think that rotating your DKIM keys regularly might help to some
degree, but my assumption is that someone who manages to get your
to DKIM sign a bad message is most likely going to start their mass
sending activities almost immediately. If nothing else, the longer
they wait the more out of place the message's (signed)
header will look.)
Sadly, my experience is that big commercial anti-malware detection is better
For reasons beyond the scope of this entry, for the past couple of years I've been running a large commercial anti-spam system (and its malware recognition) side by side with what we could put together with ClamAV and some low-cost commercial ClamAV signature sources. More or less from the beginning it's been clear to me that our commercial system was recognizing malware that ClamAV was not. Some of this was new things that we could add to our manual recognition and rejection, but at this point another significant source of missed ClamAV recognition is (still) malware in Microsoft Office files.
This is not really a result that I was hoping for. Our commercial anti-spam system has been on vendor life support for more than a year, so its recognition engine definitely isn't being updated for new capabilities and who knows how much its signature database is being updated. Despite that, it's still ahead of a well regarded open source malware detection system.
Some amount of bad email makes it through both ClamAV and our commercial anti-spam system and is then forwarded on to elsewhere by some of our users. These days, that elsewhere includes both Office365 and GMail. Trawling our logs suggests that both of these recognize and reject even more malware than we do, although this effect is somewhat entangled in them also recognizing more spam than we do.
This is not really surprising. Large providers of email and of anti-spam services have more resources for both improving their scanning engines and coming up with signatures and danger signs. They see more email (one way or another) and can build more sophisticated systems to analyze it in various ways. Greater volume with automated analysis and feedback systems can mean faster responses to new malware. It's not really surprising that the open source and small commercial firms can't match this.
(One suggestive thing is that our commercial anti-spam software provider is not getting out of the anti-spam business. Instead, it's moving to having only a cloud filtering option, where you run your incoming email through their cloud systems. This gives them far more aggregate visibility into potential malware and makes responding to it much faster. I suspect that they were pushed to this partly to match the malware filtering quality of the big providers like Google and Microsoft.)
PS: For Microsoft Office files specifically, it might be possible for us to build something using oletools, and we may have to try to, just to not let too much bad stuff through once we can no longer use the commercial anti-spam software.
(This is one unhappy aspect of how running your own email is increasingly an artisanal choice. It's possible that a lot of manual tuning and adjustment and software will get us to something close to the quality of big commercial providers, but it's unlikely to be easy.)
We're seeing increasingly targeted and dangerous phish spam attempts
In the old days, phish spam was generally pretty crude and generally easily recognized. A lot of it still is, but we're increasingly seeing some pretty sophisticated and targeted phish spam. Some of the latest phish spam we've seen uses essentially exact duplicates of university web pages and authentication dialogs, and has relatively convincing pitches in the email to get people to click on the links. To me, this is scary and goes well beyond assuming we can be phished, as I did in 2019. In 2019, I thought that an alert person might still have a reasonable chance. Now, I think that all that's between us and a significant scale compromise is that attackers aren't that committed yet (and whatever multi-factor authentication has propagated to our user population).
The university has it somewhat worse than companies do, in that our "internal" information really isn't. Since we have a large and and varied user population and almost all of our internal services websites are public, there's very little information on how the university sends out email notifications about things and what our internal websites look like that couldn't be found by a dedicated attacker. With that information in hand, the attacker could put together a basically letter-perfect fake.
(There are some technical measures the university has adopted to try to make such fake emails more obvious, but the only real mitigation is multi-factor authentication, which itself has assorted limitations.)
In light of all of this, one of the things I wonder is how long people will continue using email to deliver high-sensitivity information. One thing that has to be attractive to the university is moving to delivering all notifications about things like payroll, benefits, vacation planning, and so on (basically anything that would actively prompt people to log in) over a communication method that simply doesn't allow outsiders to send messages in.
(This is especially the case because the university already has access to such a communication method and is encouraging staff and faculty to adopt it for general use. I'm not naming the services involved because it's provided by a large commercial organization that doesn't need free publicity.)
Some thoughts on new top-level domains being used for spam
Over on Twitter, I had a little exchange:
@thatcks: Another day, another new vanity TLD that I'm never accepting email from (because of spam, of course; the dominant use of vanity TLDs in email senders is for spam).
@MrDOS: This is a self-fulfilling prophecy, though: by denying legitimate mail from these TLDs, you're guaranteeing that no one will ever be able use these TLDs for legitimate mail.
@thatcks: When the spammers get there first, the well is poisoned. Un-poisoning the well is not my (or anyone's) problem; we just not want to be fed poisoned water.
On the one hand, I think that my reaction and final tweet are not wrong. Potential receivers of email are under no obligation to help senders get it delivered, and if something only or mostly sends you spam, well, you can sensibly block it and many people will. As a result, spammers can and do poison certain things, including new top level domains (mostly generic TLDs, but sometimes country ones as well).
(Although I can't find a link to it, I believe I once saw a summary of a study on how many new gTLD domains were canceled or removed almost immediately after creation. For many active gTLDs, a surprisingly large number of new domains went away very rapidly. The study didn't conclusively say they were used primarily for spam and other bad purposes, but that was the obvious speculation.)
On the other hand, this feels uncomfortable close to pushing email further toward a closed system in practice, where only large existing senders of email can get their email accepted and other people are frozen out. Setting up a broad based block of any sort (whether a gTLD or a large network (IP) area) makes it incrementally harder for people to send email from new, not well established hosts, and anecdotally that's already hard.
On the third hard, my personal email box is a much different thing than a large mail provider. Decisions made by Google, Microsoft, and so on about who they will accept email from (and what they will require from that email) have far bigger effects than my decisions do. It also feels like the central decisions of Google and so on are fundamentally different (and more dangerous) than the aggregated distributed decisions of a large number of people, even if they come to roughly the same end result.
I don't have any firm answers, especially universal ones, but I'm not likely to change my own personal blocks. Sorry, gTLDs and people using them, but not really. In the end I care more about my mailbox than anything else, because I've just become too tired of the state of modern email.
(I have mixed views on new TLDs in general, but that's somewhat separate from their use in email.)
Errors during SMTP conversations aren't trustworthy, illustrated
Recently we had a mail problem where we could not deliver email to a particular remote destination for a while. A major Australian ISP spent six days telling us:
421 4.7.25 Temporarily rejected. Reverse DNS for <our-IP> failed. IB108
(Based on Exim log messages, this happened during the initial SMTP connection, before we even EHLO'd.)
Then later the ISP was fine again, sadly after the person trying to send mail had their attempts time out and contacted us to see if we could do anything about it. The ISP was fine before this incident, and they've been fine ever since, and no other destination reported anything like this message to us.
We did not have malfunctioning nameservers or missing reverse DNS for six days. We did not, as far as we can tell, have DNS servers that the outside world had problems reaching for six days. I suppose it's possible that this large ISP had some internal problem that prevented their DNS servers from talking to our DNS servers for six days, but not so big that they noticed it and dealt with it right away. Alternately, perhaps this ISP was not being honest with us about why they decided not to accept connections from our outgoing email server. We can't tell.
(During the six day problem period, our user was able to reach their recipient on this ISP from some other places, both of which are big email heavyweights, so it was not an issue with the recipient or with the ISP's mail system in general.)
It's not really news or a new thing that the messages you get from other people's mail servers are not necessarily telling you the real reason that your messages aren't being accepted. Many of the major mail providers seem to do it; it's been a long time since I really believed GMail's SMTP time messages, for example. We have many cases where GMail will give temporary 4xx SMTP error codes for an email for a while with various claims in the SMTP error messages, then wind up accepting it. In other cases the 'temporary' 4xx error codes stick for as long as we want to keep retrying and we eventually time out the message.
(My personal lesson learned from this incident was that I should pay more attention to our queued email, then look into things that seemed odd. At the very least I might have been able to reproduce this outside of Exim, and test it from other IPs on the same subnet and elsewhere within the university.)
There are limitations to what expendable addresses can help with
I'm a long time advocate of using expendable addresses for as many things as possible (and then making sure you can turn them off). However, yesterday's incident of junk email as a cover for worse also shows some of the limitations of using expendable addresses, because they wouldn't really have avoided this situation.
The first way they wouldn't have avoided the situation (of having a flood of junk email sent to someone to distract them) is that generally expendable addresses in all of their forms still funnel into your actual mailbox. Some people sort some expendable addresses into low-priority places, but you're unlikely to do this with the email address you use for things like notifications from your financial institutions. You usually want to see those right away, not have them hidden away.
The second way they wouldn't have avoided the situation is that if someone wants to unleash a flood of email onto you to distract you, it doesn't necessarily matter what exact email address they get their hands on. All they need is some email address that goes into some mailbox that you look at regularly. It would be better to get the actual email address you use with your financial institution, but for drowning a bit of signal in a lot of noise, often many email addresses will do about as well. It doesn't even have to go to the right mailbox, just one that will cause you to drown in the volume.
(Certainly this would be the case for me. I would have an easier time of sorting things later and perhaps not missing signal amidst noise with my extensive collection of expendable addresses, but in the heat of the moment, if you clog up my inbox it doesn't really matter how.)
The one part of this sort of flood that expendable addresses will help with is the longer term aftermath. One of the iron rules of email addresses is that once some people have their hands on some email address, they will never stop emailing it. After a flood, obviously a lot of people have some email address of yours and a certain percentage of them will keep emailing that address forever. If the address they have is an expendable address that you can turn off, you can at least make them go away.
Junk email as a cover for more nefarious things
This morning, we got a call (through a Point of Contact) that one of the people here was being absolutely flooded by incoming spam and junk email. It was a real flood, too; in total they received over 1,200 email messages that made it past our anti-spam defenses, most of them over about an hour and a half (I'll let you do the math on the messages per minute rate, and then think about trying to do anything about it in a mail client). This person would up having to basically turn off receiving external email.
Unfortunately, this wasn't the only thing going on in that person's life this morning, because they also discovered an unauthorized financial transaction (I don't know if they found it before or after the flood stared, but I suspect before). The obvious theory is that this sudden, exceptional flood of junk email is not at all a coincidence, and was instead intended to cover up a transaction notification from the financial institution involved. To abuse a phrase, if you can't stop a tree from falling, perhaps you can obscure it by clear-cutting the entire forest around it.
We rejected some of the incoming email at SMTP DATA time, which causes Exim to log some message headers. Based on these rejections and also various of the sending addresses, some of the incoming email appears to have been 'congratulations on signing up for our mailing list', 'thank you for contacting us', and so on email that could be deliberately induced by a third party who wanted to flood someone's mailbox. Other messages seem to have been genuine spam, or very likely genuine spam.
(I am sure you will be shocked to hear that Sendgrid features high up in the list of sending sources, and also the list of sources blocked because of SBL listings.)
One of the unnerving things about this incident is that the attacker clearly was highly prepared. They had at least a thousand (or more) potential sources of junk and spam email identified and lined up, ready to trigger. And it's pretty clear that the triggering was automated. Since the sources of the junk email come from all over, it seems likely that the attacker wasn't exploiting a single piece of (web) software to stuff in addresses. They probably had an entire suite of attacks against various different 'contact us' and 'subscribe me' and so on forms ready to go.
(I have no theories for how the attacker got spammers to start emailing this address so fast. Maybe there is a market for 'hot email addresses, mail them now while they last' where the purchased addresses get used basically immediately.)