How CSLab currently does server side email anti-spam stuff (version 2)

May 25, 2012

What I wrote about the Computer Science department's spam filtering back in 2007 is still broadly correct, but as you might expect the passage of several years of time has changed some of the details and added some things. I'm not going to repeat stuff from the original here, just supplement it with some additional notes that are current as of May 2012.

Most of the server side anti-spam stuff we do happens on our external MX gateway. These days it does a number of anti-spam related things:

  1. We still wait a bit before giving initial greetings and responses to EHLO/HELO. I'm not convinced that this does any good these days (if it ever did), but the code is there so it's staying. Inertia is a powerful force sometimes.

  2. As everyone should , we insists on valid addresses in MAIL FROM and RCPT TO to the extent that we can verify them simply (we don't do any sort of callback verification, partly because it's evil and very hard to do right). We can fully verify our own addresses (for both senders and recipients) and we verify that outside domains actually exist.

  3. At RCPT TO time, addresses that have opted into server side spam handling immediately reject email from IP addresses in zen.spamhaus.org and apply greylisting if they have enabled it. These days we have a self-serve system where users can set email addresses under their control to either moderate or strong spam filtering, with greylisting as an option for either.

    (The self-serve system isn't well publicized but a certain number of people have taken advantage of it.)

  • update: Also at RCPT TO time, each recipient address can have a separate per-address blacklists of sending hosts and MAIL FROM addresses that are immediately rejected. This feature is not currently exposed to our users; it's primarily used to block certain spam sources from administrative addresses that we have to leave generally unscreened.

    (It's sufficiently obscure that I forgot about it when I first wrote this. Hopefully I haven't forgotten anything else.)

  1. At DATA time, and provided that all of the destination addresses have opted in to server side spam filtering, we call out to a milter interface on our Sophos PureMessage install in order to get a spam and virus indication for the message. If it scores enough, we immediately reject it. Otherwise we accept it and continue processing.

  2. If the sender is in zen.spamhaus.org we add a message header about it. This is at least theoretically useful for people's filtering and also gets used later in our processing for some things.
  3. The message is run through Sophos PureMessage again using a non-milter interface that allows message modification. This trip actually strips known viruses and, if the message has a high enough spam score, adds a note about it to the start of the Subject: header. Note that this means that some number of messages actually get run through Sophos PureMessage twice, once at DATA time to perform the milter check (the results of which are effectively thrown away) and then a second time to do the real filtering.

    (At the mechanical level this step uses SMTP, which is why it can modify the message when our hacked-together Exim milter setup can't. Our Sophos configuration does the same thing for the SMTP filtering and the milter interface; the only difference is the communication process.)

  4. If the message was tagged as spam (or a virus) and is to someone who has opted for strong spam filtering, it's discarded. Well, technically they're dropped from the recipient listing; unlike SMTP DATA time filtering, this can be done selectively for only some recipients.

After this the message is delivered to our central email machine for actual processing and delivery and so on. If it was tagged as spam, this may lead to it getting automatically discarded by the mail system due to things like our special .forward system to easily discard spam or a similar system for automatically diverting spam to local mailing lists. Otherwise, what happens to spam-tagged messages is still up to our users; each person gets to decide for themselves how they want to handle such emails and people have adopted a wide variety of measures.

We deliberately expose only very generic and high-level server side spam filtering options to our users; for each of their addresses they can opt for 'moderate' or 'strong' filtering, with or without greylisting. Being generic means that we preserve our freedom to evolve just what each level of filtering does over time in a way that we wouldn't have if users had, for example, specifically opted in to or out of 'reject email if the sending IP is in zen.spamhaus.org'. We make only relatively generic promises about what each level does; the most important one is that moderate spam filtering always rejects at SMTP time so that if it misfires the sender knows about it.

(We do document what each level of filtering currently does, but we also specifically document that this can change and if you opt in to one of them you understand that the details may change over time.)

At a mailer level things are much more broken down, which means that we can hand-manipulate what filtering happens for specific addresses at a relatively specific level of detail. We use this power to apply special filtering (for example, strong milter filtering only) to some addresses. It's possible that we should expose some of this power to users, but doing so would present users with more and more choice complexity and also add constraints on our ability to evolve things in the future.

(At the same time we do want to offer users options that match the choices they want to make. One big question is what those choices are; we don't really know how our users think about spam filtering and so on.)

Written on 25 May 2012.
« Today's Mercurial command alias: a short form hg incoming
Complications in spam filter stats in our environment »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri May 25 14:21:42 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.