We need a way to scan Microsoft Office files for malware

March 28, 2022

For reasons beyond the scope of this entry, for the past couple of years I've been running a large commercial anti-spam system (and its malware recognition) side by side with what we could put together with ClamAV and some low-cost commercial ClamAV signature sources. Since the commercial anti-spam system is on the way out, one of the things I keep an eye on is what it detects as malware that ClamAV misses (and then I try to figure out if there's some message signature we can use to block it, like a .scr file inside a .7z attachment). More or less from the beginning and continuing on through the last time I mentioned this, one significant area where the commercial system is better is detecting bad stuff in Microsoft Office files.

(The commercial system has also picked up stuff in PDFs that ClamAV doesn't. In general it feels like it's better at finding bad stuff in complex and nested file formats, but I haven't looked at this closely.)

With the end of service life of the commercial software getting closer and closer, my feelings that we should actively try to do something about this are getting bigger and bigger. We unfortunately can't completely block Microsoft Office macros (some of our users do get legitimate email with them included), which I understand are one of the big vectors, but there are probably others. As far as I know, the only good open source tool for scanning Microsoft Office files is the oletools Python package, and conveniently we're already scanning email with a Python program.

Oletools has some support for identifying Microsoft Office files with 'bad stuff', but I believe it's partly in the form of a command line tool, mraptor, which has no API documentation for using it as a package. Now that I look more closely, there's also oleid and olevba. The command line tools don't look like they have an output format that's good for script usage, although I not be looking closely enough at their options. If people have wrapped these up in canned tools to scan an attachment and give you an indicator of how bad it is, I can't find such tools in some Internet searches.

Right now one issue is the same one we had with attachment types, where we didn't know what sort of attachments our users got, both in legitimate email and in spam. Today we don't know what sorts of things are in the Microsoft Office files our users receive. How prevalent are macros, embedded OLE objects, macros with suspicious attributes, and so on? Since it seems unlikely we'll be able to get a Microsoft Office scanning tool (either open source or commercial) that gives us a carefully curated 'good' or 'bad' answer, we're going to have to work that out based on our usage patterns, and that means learning what the usage patterns are.

So probably the first thing I need to do is make our attachment scanning program more complicated by having it use oletools to analyze Microsoft Office files and record information about them, just as we record file extension information for files in archives.

(I would dearly love to be able to pay for this from someone, but as far as I know there's nothing. Paying other people for malware detection is in my opinion better than trying to do it myself, partly because I'm never going to be a full time specialist at this and there's some chance that people we pay will be.)

Written on 28 March 2022.
« Some thoughts on Go's unusual approach to identifier visibility
Fixing Pipx when you upgrade your system Python version »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Mar 28 21:58:45 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.