Some pragmatics of blackbox and whitebox malware filtering
We've been looking into using ClamAV as part of our anti-spam and anti-malware filtering here, either in addition to our current commercial filter or as a replacement for it. While we've been operating ClamAV, we've been looking at and into the things it matches signatures for, and discovering things like third party ClamAV signatures include a lot of phish and spam.
(Depending on the form of the email, some or a lot of the phish and spam is also recognized as 'malware' by our commercial filter.)
One of the differences between ClamAV and our commercial filter is that ClamAV's signature formats are documented, and a lot of signatures are for content matches that are relatively simple to decode. Once I read up on ClamAV's signature formats, I started looking at what various matched signatures were actually matching against and discovered that many of them were relatively simple and straightforward text chunks, some of which it's possible to see in email that merely quotes chunks of spam messages or mentions things that are spam signatures.
On the surface, this seemed not great when I initially started understanding it. Then I thought about it more. The first thing is that we don't actually know how the malware signatures of our commercial filter work, and because we don't know we have very little idea if they're actually any more detailed than ClamAV's signatures. Because our commercial filter is a black box, we have to take its entire work on trust (and we mostly have been); we have no way to peer inside and see on what grounds (solid or otherwise) it's making its decisions. In theory, if ClamAV's signatures are accurate (especially in not incorrectly rejecting legitimate email as tainted by malware), the specifics of ClamAV's signatures don't matter. If a short text snippet really is a basically invariable sign of bad stuff, well, there it is.
(This also ties into the pragmatics of classifying phish spam as malware.)
But the second thing is that spam filtering is not a purely technical problem, and because of that how much you know about how your spam filtering operates can matter in the squishy real world that we actually inhabit. When you know how ClamAV malware signatures work, can inspect them, and have found some that may be really quite simple text chunks, people may hold you to a higher standard for dealing with false positives than if you're simply operating a black box. Even if you don't look in advance, ClamAV's nature means that you can and could have, and people may hold you somewhat responsible for not doing so in a way that they won't for a black box system.
In short, in a black box system you can wash your hands in a way that you can't with a white box system. You don't know and you couldn't have known. Of course this isn't always a good thing; if your black box system misfires, you may not be able to understand why and do anything about it in the way that you can with a white box system.
(One conclusion for us is that I should figure out a way to control what ClamAV signature names we pay attention to and what ones we ignore. Probably I can do this in Exim with some control files for on the fly alterations.)