2012-05-25
How CSLab currently does server side email anti-spam stuff (version 2)
What I wrote about the Computer Science department's spam filtering back in 2007 is still broadly correct, but as you might expect the passage of several years of time has changed some of the details and added some things. I'm not going to repeat stuff from the original here, just supplement it with some additional notes that are current as of May 2012.
Most of the server side anti-spam stuff we do happens on our external MX gateway. These days it does a number of anti-spam related things:
- We still wait a bit before giving initial greetings and responses
to
EHLO
/HELO
. I'm not convinced that this does any good these days (if it ever did), but the code is there so it's staying. Inertia is a powerful force sometimes. - As everyone should , we insists on valid addresses
in
MAIL FROM
andRCPT TO
to the extent that we can verify them simply (we don't do any sort of callback verification, partly because it's evil and very hard to do right). We can fully verify our own addresses (for both senders and recipients) and we verify that outside domains actually exist. - At
RCPT TO
time, addresses that have opted into server side spam handling immediately reject email from IP addresses in zen.spamhaus.org and apply greylisting if they have enabled it. These days we have a self-serve system where users can set email addresses under their control to either moderate or strong spam filtering, with greylisting as an option for either.(The self-serve system isn't well publicized but a certain number of people have taken advantage of it.)
- update: Also at
RCPT TO
time, each recipient address can have a separate per-address blacklists of sending hosts andMAIL FROM
addresses that are immediately rejected. This feature is not currently exposed to our users; it's primarily used to block certain spam sources from administrative addresses that we have to leave generally unscreened.(It's sufficiently obscure that I forgot about it when I first wrote this. Hopefully I haven't forgotten anything else.)
- At
DATA
time, and provided that all of the destination addresses have opted in to server side spam filtering, we call out to a milter interface on our Sophos PureMessage install in order to get a spam and virus indication for the message. If it scores enough, we immediately reject it. Otherwise we accept it and continue processing. - If the sender is in zen.spamhaus.org we add a message header about it. This is at least theoretically useful for people's filtering and also gets used later in our processing for some things.
- The message is run through Sophos PureMessage again using a non-milter
interface that allows message modification. This trip actually
strips known viruses and, if the message has a high enough spam
score, adds a note about it to the start of the
Subject:
header. Note that this means that some number of messages actually get run through Sophos PureMessage twice, once atDATA
time to perform the milter check (the results of which are effectively thrown away) and then a second time to do the real filtering.(At the mechanical level this step uses SMTP, which is why it can modify the message when our hacked-together Exim milter setup can't. Our Sophos configuration does the same thing for the SMTP filtering and the milter interface; the only difference is the communication process.)
- If the message was tagged as spam (or a virus) and is to someone who has
opted for strong spam filtering, it's discarded. Well, technically
they're dropped from the recipient listing; unlike SMTP
DATA
time filtering, this can be done selectively for only some recipients.
After this the message is delivered to our central email
machine for actual
processing and delivery and so on. If it was tagged as spam,
this may lead to it getting automatically discarded by the mail
system due to things like our special .forward
system to
easily discard spam or a similar
system for automatically diverting spam to local mailing lists. Otherwise, what happens to
spam-tagged messages is still up to our users; each person gets to
decide for themselves how they want to handle such emails and people
have adopted a wide variety of measures.
We deliberately expose only very generic and high-level server side spam filtering options to our users; for each of their addresses they can opt for 'moderate' or 'strong' filtering, with or without greylisting. Being generic means that we preserve our freedom to evolve just what each level of filtering does over time in a way that we wouldn't have if users had, for example, specifically opted in to or out of 'reject email if the sending IP is in zen.spamhaus.org'. We make only relatively generic promises about what each level does; the most important one is that moderate spam filtering always rejects at SMTP time so that if it misfires the sender knows about it.
(We do document what each level of filtering currently does, but we also specifically document that this can change and if you opt in to one of them you understand that the details may change over time.)
At a mailer level things are much more broken down, which means that we can hand-manipulate what filtering happens for specific addresses at a relatively specific level of detail. We use this power to apply special filtering (for example, strong milter filtering only) to some addresses. It's possible that we should expose some of this power to users, but doing so would present users with more and more choice complexity and also add constraints on our ability to evolve things in the future.
(At the same time we do want to offer users options that match the choices they want to make. One big question is what those choices are; we don't really know how our users think about spam filtering and so on.)