Some thoughts on the pragmatics of classifying phish spam as malware
We are currently looking at how ClamAV would do at checking our incoming email for malware and other things that it has signatures for. As I noticed once we started doing this, third party ClamAV signatures seem to include a lot of phish and other spam. One of the questions this raises is whether this matters on a pragmatic level, and how.
On the one hand, I am a big believer in the idea that most people don't care about how you break up the various components of your spam filtering and what you call them. If your spam filter works they are happy to ignore the whole issue, and if it doesn't work they're going to be unhappy no matter what you call the bits and what excuses you put forward (including the idea of 'spam levels'). At this level, ClamAV signatures recognizing phish and other spam doesn't matter provided that the signatures work, which mostly means that they don't have false positives.
On the other hand, you don't want to deal with malware and viruses in the same way that you can deal with spam. Because malware and viruses are actively dangerous as opposed to simply being annoying the way spam is, I feel that it's unsafe to simply mark email as having malware and pass it through intact in the way that it's common to mark spam. If you detect malware, you want to reject the entire message because in modern email, malware in a message means the entire thing is bad. This means that people have no chance to retrieve a false positive email from their filters; false positives are gone for good (although the sender will find out about it).
As a corollary, you can offer people a lot more options about what to do about 'spam' email than you can for 'malware' email, and you can let people adjust these options themselves as their views on the quality of your spam filtering changes. When you classify what is really 'spam' as 'malware', you take away this potential control and flexibility from people.
At the same time, phish spam can be actively dangerous in somewhat the same way that malware is; they can both lead to compromised machines and accounts, even if the mechanism is different. If what we care about is danger, we should probably reject phish spam that we can recognize (assuming no false positive risk). In this view, third party ClamAV signatures including phish spam is a good thing, as is rejecting email at SMTP time when the signatures match.
Finally, if we assume that the ClamAV signatures for recognizing spam are carefully created, they should be a much stronger signal about the quality of an email message than various spam scoring heuristics are. Heuristics are necessarily uncertain, while 'this matches a spam sample' is not very (although there are nuances, like people emailing complaints about spam from you to your abuse address). My guess is that a lot of the risks of false positives in ClamAV spam signatures come down to some combination of quoted email and just how much text from the message the signature incorporates.
(This is where I should read up on how ClamAV signatures work so that I can try to look at what various of these third party ClamAV signatures are really matching. Ideally they will be matching entire spam samples, but the reality may not live up to this.)