A quick look at some spam filtering stats from our system

April 26, 2011

It's been a while since I thought about generating statistics about what our anti-spam systems are doing and seeing, which probably means that it's about time to do it again. I'm going to look at the past week's statistics, mostly because we upgraded the spam filtering machine recently and we don't have old logs any more. Unfortunately this is not an ideal week to look at, since Friday was a holiday here so the numbers are going to be down from usual.

First, the disclaimers: not all spam makes it to our spam tagging and filtering system. For example, some people immediately reject email from IP addresses that are in the Spamhaus Zen list; since this rejects at RCPT TO time, the actual message never makes it to the spam filtering system to be scored. At this time I haven't generated stats on how large an effect that is.

So, over the past seven days we saw:

  • 91,171 messages in total. The volume is mostly during weekdays, and once I wave my hands about the holiday Friday I'll call it flat during the weekdays and flat (at a lower level) on the weekend as well.

  • 557 messages that were identified as having some sort of virus payload. Apparently viruses are not very popular any more (or at least not viruses that our system can recognize).

  • 47,592 messages that scored high enough to be classified as spam by our system. I don't want to draw any conclusions about day of the week volume from the data I have so far.

This is well under the level of spam that most sources report. It's possible that our stats are skewed by various things; for example, it may be that most of the active targets of spam have opted in to spam rejection, and so spam to them never makes it to these numbers. (Trying to quantify the volume of rejections is a project for later.)

Our spam system gives messages a spam score from 0 to 100 (with some decimal points of precision allowed; theoretically this is some sort of probability measure). The breakdown of scores is somewhat interesting:

  • 22,448 messages scored 100 points.
  • 20,398 messages scored 90 to 99 points. Of those, 14,170 scored 99 points and 1,222 scored 98 points, so almost all of this scoring band were at the top.
  • 4,131 messages scored 80 to 89 points.
  • 330 messages scored 70 to 79 points.
  • 285 messages scored 60 to 69 points.
  • 584 messages scored 50 to 59 points, 257 scored 40 to 49 points, and 279 messages scored 30 to 39 points.
  • 1,083 messages scored 20 to 29 points
  • 12,899 messages scored 10 to 19 points.
  • 28,477 messages scored 0 to 9 points. The lowest scoring messages had seven points and there were 17,807 of them, then 8,109 messages scoring 8 points and 2,561 messages scoring 9 points.

Our current threshold for calling something spam is 60 points or more. These numbers suggest that we could significantly raise the threshold without having a material effect on our spam filtering; on the other hand, since it would have no material effect there seems no reason to do it (other than possibly user perception, and I don't know if users pay any attention to this).

(Note that this is not the same system that I did my old spam stats for, and so if I do regular reports they are going to look different and not be comparable to the old numbers.)

Written on 26 April 2011.
« An important note about multi-line log message formats
Some notes on what __dictoffset__ on types means in CPython »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Apr 26 00:31:36 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.