2015-08-15
Spam scoring systems are often not deliberately designed
In theory, my concerns about how other people's systems will react to us DKIM-signing only some of our email have a simple answer; if we don't add DMARC information that says to react to unsigned email in some way, they should do nothing. This is the spec compliant behavior and you'd have to be really obnoxious to decide to do otherwise. But that assumes that spam scoring systems are in fact deliberately designed, and my current belief is that the custom systems major email providers use are not in that sense. By that I mean that no human being sat down to write out and set up more than a small fraction of the scoring rules they use.
In today's world, one obvious path to a sophisticated spam scoring system is through various forms of statistical reasoning and machine learning (of which Bayesian spam filtering is a simple starting point). All of these techniques uncover correlations between message features and outside spam scores (as determined in various ways, such as through users telling you), and they're all blind to what those features mean as such and whether or not they 'should' be used for some purpose or interpreted in some way.
I assume that every major email provider is running such a system
as part of their overall spam filtering (and there's some evidence
for this in the behavior of their systems). I further assume that
they're all shoveling every message feature they can get their hands
on into these systems, because why not; the more features the better.
I also think it's extremely likely that one of these features is
DKIM information. At this point it's not particularly hard to come
up with scenarios where you can objectively find correlations between
things like the lack of a DKIM signature in email From: a particular
domain and the likelihood of such a message being seen as spam.
That there are legitimate email messages like this doesn't matter
to a machine learning system any more than the fact that you're not
supposed to use lack of DKIM signatures this way; all it cares about
is useful correlations.
No one set out to create a system that (ab)used lack of DKIM signatures this way and the generated scoring system is not deliberately designed by anyone; the most that people did was design the machine learning meta-system that trained itself on the massive collection of accumulated message data in order to create the generated scoring system. No one understands the generated rules (even Bayesian systems are hard to peer into, never mind more sophisticated approaches) and so no one can even consider auditing them for things that shouldn't be done.
The only way to avoid having some message feature inadvertently become part of a signal deep inside a machine learning system is to exclude it. I can't make GMail's and Hotmail's and Yahoo's spam filtering systems exclude DKIM signature information from the set of message features that they train their systems on. The best I can do is not provide them with the signal in the first place by never doing DKIM signatures, making all of our email identical in this.
(Of course, by doing so I'm also sending a signal, namely the total lack of any DKIM signatures for our domains. At the moment this seems like a less dangerous signal to send for various reasons.)
(I said a much shorter version of this in a comment on my previous entry, but I feel like writing it out in full as an entry.)
2015-08-14
My current views on using DomainKeys (DKIM) here
Almost five years ago I wrote about my then-new view of DKIM and how we might someday use it ourselves when we'd updated our mailers enough. Well, the mailers have been updated for a while and not only aren't we using DKIM, I'm not inclined to do so any time soon. Prompted by someone here asking for my opinions on DKIM today, here's my current views.
As far as inbound email goes, I've experimented with a Thunderbird extension to verify DKIM signatures, which showed me that a bunch of perfectly good email gets either warnings or outright failures. Given this result it's clear that our inbound mail gateway can't do anything active with DKIM results, like start rejecting or visibly marking such email; the false positives would swamp any genuine benefit or signal that might be present.
In terms of spam and DKIM, I've seen plenty of spam that has DKIM signatures (and I assume they're valid ones). I've also seen plenty that doesn't. If DKIM data provides some sort of useful signal about spam versus non-spam for email, making use of it is best left up to the black box commercial anti-spam system that we use.
(DKIM does have some clear use in anti-spam stuff since it's a component of DMARC and some people are actively using DMARC these days. But for a collection of reasons we're not going to start enforcing other people's DMARC policies on our inbound mail gateway, although the anti-spam system may take that into account when it scores email.)
For outgoing email, my major concern remains what it was before, namely how other people's systems will behave. I simply
don't know how other systems will react to all of our valid DKIM
signed email, email we DKIM signed but that then got changed in
transit, and email 'From:' us but without a DKIM signature from
us. Without confidence that adding DKIM signing will be harmless,
I don't feel any enthusiasm for doing so. At this point I'd probably
only enable DKIM if there was some significant recipient system
that started more or less demanding that we provide it in order to
get our email delivered to them.
(I'm sure that eg GMail would like us to start doing DKIM signing,
but that they'd like us to do that is exactly why I don't want to.
Almost anyone who actively cares about us doing DKIM is going to
use it as input into a spam scoring system, and since we consider
it fully valid to send email From: our addresses but not through
our machines, the last thing I want to do is enable that particular
signal.)