2012-05-27
Complications in spam filter stats in our environment
It seems useful to generate some kind of periodic stats on how our spam filtering system is doing, something like the one off stats I've done before. But this raises a question: what actual numbers are meaningful, and how do we get them?
There are two problems, one general and one pragmatic. The general
problem is the difference between counting per-recipient events and
per-message events. Many of our spam filtering things happen only to
a single recipient and only get logged that way; others happen to an
entire message and are logged that way. Even if and when one can go
from per-recipient logs to per-message logs, you need to decide how to
count multi-recipient messages where what happened to the recipients is
different.
The pragmatic problem is that Exim's logs make it very difficult
(and perhaps impossible) to do this per-recipient to per-message
reconstruction in the first place, at least with our current logging
options. Exim just doesn't log enough information when it rejects an
RCPT TO to associate it with the log messages for when it processes
the whole message with its accepted recipients (if any).
(I sympathize with Exim somewhat, because the two sets of log messages
are produced at completely different levels. Exim's main logging is
focused on recording information about message routing and delivery,
and this whole routing process only starts happening once the message
has been accepted (eg, SMTP DATA has succeeded). Exim also logs logs
certain exceptional events during the SMTP conversation, including
rejected RCPT TOs, but all of this is before the main message
processing starts.)
What this means to me is that it's not really possible to count how
many messages we reject at SMTP time. I can count rejected recipients,
how many messages were rejected at DATA time, and how many messages
were accepted, but I can't easily count how many messages had all of
their attempted recipients rejected. Or even some of their recipients
rejected, which complicates any numbers I could come up with for how
many messages we see that have multiple recipients in the first place.
(I wouldn't be surprised if single-recipient email was by far the norm today. I can easily see how many accepted recipients we had on average, but that's not the same thing as how many total recipients.)
(This is one of those entries where I don't wind up with an answer after writing it, just more questions.)
2012-05-25
How CSLab currently does server side email anti-spam stuff (version 2)
What I wrote about the Computer Science department's spam filtering back in 2007 is still broadly correct, but as you might expect the passage of several years of time has changed some of the details and added some things. I'm not going to repeat stuff from the original here, just supplement it with some additional notes that are current as of May 2012.
Most of the server side anti-spam stuff we do happens on our external MX gateway. These days it does a number of anti-spam related things:
- We still wait a bit before giving initial greetings and responses
to
EHLO/HELO. I'm not convinced that this does any good these days (if it ever did), but the code is there so it's staying. Inertia is a powerful force sometimes. - As everyone should , we insists on valid addresses
in
MAIL FROMandRCPT TOto the extent that we can verify them simply (we don't do any sort of callback verification, partly because it's evil and very hard to do right). We can fully verify our own addresses (for both senders and recipients) and we verify that outside domains actually exist. - At
RCPT TOtime, addresses that have opted into server side spam handling immediately reject email from IP addresses in zen.spamhaus.org and apply greylisting if they have enabled it. These days we have a self-serve system where users can set email addresses under their control to either moderate or strong spam filtering, with greylisting as an option for either.(The self-serve system isn't well publicized but a certain number of people have taken advantage of it.)
- update: Also at
RCPT TOtime, each recipient address can have a separate per-address blacklists of sending hosts andMAIL FROMaddresses that are immediately rejected. This feature is not currently exposed to our users; it's primarily used to block certain spam sources from administrative addresses that we have to leave generally unscreened.(It's sufficiently obscure that I forgot about it when I first wrote this. Hopefully I haven't forgotten anything else.)
- At
DATAtime, and provided that all of the destination addresses have opted in to server side spam filtering, we call out to a milter interface on our Sophos PureMessage install in order to get a spam and virus indication for the message. If it scores enough, we immediately reject it. Otherwise we accept it and continue processing. - If the sender is in zen.spamhaus.org we add a message header about it. This is at least theoretically useful for people's filtering and also gets used later in our processing for some things.
- The message is run through Sophos PureMessage again using a non-milter
interface that allows message modification. This trip actually
strips known viruses and, if the message has a high enough spam
score, adds a note about it to the start of the
Subject:header. Note that this means that some number of messages actually get run through Sophos PureMessage twice, once atDATAtime to perform the milter check (the results of which are effectively thrown away) and then a second time to do the real filtering.(At the mechanical level this step uses SMTP, which is why it can modify the message when our hacked-together Exim milter setup can't. Our Sophos configuration does the same thing for the SMTP filtering and the milter interface; the only difference is the communication process.)
- If the message was tagged as spam (or a virus) and is to someone who has
opted for strong spam filtering, it's discarded. Well, technically
they're dropped from the recipient listing; unlike SMTP
DATAtime filtering, this can be done selectively for only some recipients.
After this the message is delivered to our central email
machine for actual
processing and delivery and so on. If it was tagged as spam,
this may lead to it getting automatically discarded by the mail
system due to things like our special .forward system to
easily discard spam or a similar
system for automatically diverting spam to local mailing lists. Otherwise, what happens to
spam-tagged messages is still up to our users; each person gets to
decide for themselves how they want to handle such emails and people
have adopted a wide variety of measures.
We deliberately expose only very generic and high-level server side spam filtering options to our users; for each of their addresses they can opt for 'moderate' or 'strong' filtering, with or without greylisting. Being generic means that we preserve our freedom to evolve just what each level of filtering does over time in a way that we wouldn't have if users had, for example, specifically opted in to or out of 'reject email if the sending IP is in zen.spamhaus.org'. We make only relatively generic promises about what each level does; the most important one is that moderate spam filtering always rejects at SMTP time so that if it misfires the sender knows about it.
(We do document what each level of filtering currently does, but we also specifically document that this can change and if you opt in to one of them you understand that the details may change over time.)
At a mailer level things are much more broken down, which means that we can hand-manipulate what filtering happens for specific addresses at a relatively specific level of detail. We use this power to apply special filtering (for example, strong milter filtering only) to some addresses. It's possible that we should expose some of this power to users, but doing so would present users with more and more choice complexity and also add constraints on our ability to evolve things in the future.
(At the same time we do want to offer users options that match the choices they want to make. One big question is what those choices are; we don't really know how our users think about spam filtering and so on.)