SpamStatsComplications written at 02:30:44; Add Comment
Complications in spam filter stats in our environment
It seems useful to generate some kind of periodic stats on how our spam filtering system is doing, something like the one off stats I've done before. But this raises a question: what actual numbers are meaningful, and how do we get them?
There are two problems, one general and one pragmatic. The general
problem is the difference between counting per-recipient events and
per-message events. Many of our spam filtering things happen only to
a single recipient and only get logged that way; others happen to an
entire message and are logged that way. Even if and when one can go
from per-recipient logs to per-message logs, you need to decide how to
count multi-recipient messages where what happened to the recipients is
The pragmatic problem is that Exim's logs make it very difficult
(and perhaps impossible) to do this per-recipient to per-message
reconstruction in the first place, at least with our current logging
options. Exim just doesn't log enough information when it rejects an
(I sympathize with Exim somewhat, because the two sets of log messages
are produced at completely different levels. Exim's main logging is
focused on recording information about message routing and delivery,
and this whole routing process only starts happening once the message
has been accepted (eg, SMTP
What this means to me is that it's not really possible to count how
many messages we reject at SMTP time. I can count rejected recipients,
how many messages were rejected at
(I wouldn't be surprised if single-recipient email was by far the norm today. I can easily see how many accepted recipients we had on average, but that's not the same thing as how many total recipients.)
(This is one of those entries where I don't wind up with an answer after writing it, just more questions.)
CSLabSpamFilteringII written at 14:21:42; Add Comment
How CSLab currently does server side email anti-spam stuff (version 2)
What I wrote about the Computer Science department's spam filtering back in 2007 is still broadly correct, but as you might expect the passage of several years of time has changed some of the details and added some things. I'm not going to repeat stuff from the original here, just supplement it with some additional notes that are current as of May 2012.
Most of the server side anti-spam stuff we do happens on our external MX gateway. These days it does a number of anti-spam related things:
After this the message is delivered to our central email
machine for actual
processing and delivery and so on. If it was tagged as spam,
this may lead to it getting automatically discarded by the mail
system due to things like our special
We deliberately expose only very generic and high-level server side spam filtering options to our users; for each of their addresses they can opt for 'moderate' or 'strong' filtering, with or without greylisting. Being generic means that we preserve our freedom to evolve just what each level of filtering does over time in a way that we wouldn't have if users had, for example, specifically opted in to or out of 'reject email if the sending IP is in zen.spamhaus.org'. We make only relatively generic promises about what each level does; the most important one is that moderate spam filtering always rejects at SMTP time so that if it misfires the sender knows about it.
(We do document what each level of filtering currently does, but we also specifically document that this can change and if you opt in to one of them you understand that the details may change over time.)
At a mailer level things are much more broken down, which means that we can hand-manipulate what filtering happens for specific addresses at a relatively specific level of detail. We use this power to apply special filtering (for example, strong milter filtering only) to some addresses. It's possible that we should expose some of this power to users, but doing so would present users with more and more choice complexity and also add constraints on our ability to evolve things in the future.
(At the same time we do want to offer users options that match the choices they want to make. One big question is what those choices are; we don't really know how our users think about spam filtering and so on.)
* * *
Atom feeds are available; see the bottom of most pages.