Wandering Thoughts archives

2015-08-15

Spam scoring systems are often not deliberately designed

In theory, my concerns about how other people's systems will react to us DKIM-signing only some of our email have a simple answer; if we don't add DMARC information that says to react to unsigned email in some way, they should do nothing. This is the spec compliant behavior and you'd have to be really obnoxious to decide to do otherwise. But that assumes that spam scoring systems are in fact deliberately designed, and my current belief is that the custom systems major email providers use are not in that sense. By that I mean that no human being sat down to write out and set up more than a small fraction of the scoring rules they use.

In today's world, one obvious path to a sophisticated spam scoring system is through various forms of statistical reasoning and machine learning (of which Bayesian spam filtering is a simple starting point). All of these techniques uncover correlations between message features and outside spam scores (as determined in various ways, such as through users telling you), and they're all blind to what those features mean as such and whether or not they 'should' be used for some purpose or interpreted in some way.

I assume that every major email provider is running such a system as part of their overall spam filtering (and there's some evidence for this in the behavior of their systems). I further assume that they're all shoveling every message feature they can get their hands on into these systems, because why not; the more features the better. I also think it's extremely likely that one of these features is DKIM information. At this point it's not particularly hard to come up with scenarios where you can objectively find correlations between things like the lack of a DKIM signature in email From: a particular domain and the likelihood of such a message being seen as spam. That there are legitimate email messages like this doesn't matter to a machine learning system any more than the fact that you're not supposed to use lack of DKIM signatures this way; all it cares about is useful correlations.

No one set out to create a system that (ab)used lack of DKIM signatures this way and the generated scoring system is not deliberately designed by anyone; the most that people did was design the machine learning meta-system that trained itself on the massive collection of accumulated message data in order to create the generated scoring system. No one understands the generated rules (even Bayesian systems are hard to peer into, never mind more sophisticated approaches) and so no one can even consider auditing them for things that shouldn't be done.

The only way to avoid having some message feature inadvertently become part of a signal deep inside a machine learning system is to exclude it. I can't make GMail's and Hotmail's and Yahoo's spam filtering systems exclude DKIM signature information from the set of message features that they train their systems on. The best I can do is not provide them with the signal in the first place by never doing DKIM signatures, making all of our email identical in this.

(Of course, by doing so I'm also sending a signal, namely the total lack of any DKIM signatures for our domains. At the moment this seems like a less dangerous signal to send for various reasons.)

(I said a much shorter version of this in a comment on my previous entry, but I feel like writing it out in full as an entry.)

SpamScoringNotDeliberate written at 01:56:39; Add Comment

2015-08-14

My current views on using DomainKeys (DKIM) here

Almost five years ago I wrote about my then-new view of DKIM and how we might someday use it ourselves when we'd updated our mailers enough. Well, the mailers have been updated for a while and not only aren't we using DKIM, I'm not inclined to do so any time soon. Prompted by someone here asking for my opinions on DKIM today, here's my current views.

As far as inbound email goes, I've experimented with a Thunderbird extension to verify DKIM signatures, which showed me that a bunch of perfectly good email gets either warnings or outright failures. Given this result it's clear that our inbound mail gateway can't do anything active with DKIM results, like start rejecting or visibly marking such email; the false positives would swamp any genuine benefit or signal that might be present.

In terms of spam and DKIM, I've seen plenty of spam that has DKIM signatures (and I assume they're valid ones). I've also seen plenty that doesn't. If DKIM data provides some sort of useful signal about spam versus non-spam for email, making use of it is best left up to the black box commercial anti-spam system that we use.

(DKIM does have some clear use in anti-spam stuff since it's a component of DMARC and some people are actively using DMARC these days. But for a collection of reasons we're not going to start enforcing other people's DMARC policies on our inbound mail gateway, although the anti-spam system may take that into account when it scores email.)

For outgoing email, my major concern remains what it was before, namely how other people's systems will behave. I simply don't know how other systems will react to all of our valid DKIM signed email, email we DKIM signed but that then got changed in transit, and email 'From:' us but without a DKIM signature from us. Without confidence that adding DKIM signing will be harmless, I don't feel any enthusiasm for doing so. At this point I'd probably only enable DKIM if there was some significant recipient system that started more or less demanding that we provide it in order to get our email delivered to them.

(I'm sure that eg GMail would like us to start doing DKIM signing, but that they'd like us to do that is exactly why I don't want to. Almost anyone who actively cares about us doing DKIM is going to use it as input into a spam scoring system, and since we consider it fully valid to send email From: our addresses but not through our machines, the last thing I want to do is enable that particular signal.)

DKIMViewII written at 01:50:41; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.