What broad hit rate the Spamhaus DBL might get for us

March 20, 2016

I took the past 9 days worth of logs from our commercial anti-spam black box, extracted the 'spam score' it assigns and the envelope sender domain, split this into three categories based on the broad scores from 0 to 100 that the system assigns, and then checked all of those origin domains against the Spamhaus DBL.

(Because of how our overall anti-spam systems work, this excludes some but not all of the email from hosts that are in Spamhaus's IP based lists.)

Based partly on previous stats and how we use the spam scores ourselves, my three categories were 'definitely spam' (scores of 98 to 100), 'enough to be spam' (scores of 60 through 97), and 'probably not spam' (below 60). The raw numbers are:

  • for 'definitely spam', 5,452 different MAIL FROM domains and only 812 in the DBL; a 14% hit rate.

  • for 'enough to be spam', 4,118 different domains and 1,744 in the DBL; a 42% hit rate.

  • for 'probably not spam', 5,268 different domains and 20 in the DBL.

At one level, this is actually reassuring; it suggests that our commercial black box is doing a reasonably good job of finding much of the actual spam, even though it missed some things.

(It also suggests that the black box is not already including the DBL, or at least if it does it doesn't weigh the envelope sender very high in its scoring. Otherwise those 20 domains wouldn't be there.)

The relatively low domain hit rate on the 'definitely spam' category is at least partly due to the fact that there are a lot of domains in that set that were not used for very many messages to us. In fact the median usage count for domains there is one. If I go through the effort to count DBL hits by usage, it comes out to 44% of the actual messages had sender domains in the DBL.

The usage based hit count for the 'enough to be spam' category comes out to be slightly higher; there 54% of the actual messages had sender domains in the DBL.

(As you might expect, the 'probably not spam' category doesn't improve when measured by actual usage. Percentage wise it goes way, way down, in fact, as not very many messages came from those DBL-listed domains.)

All of this means that I should definitely look at using the DBL in our overall anti-spam setup, because using the DBL would enable early rejection of a significant amount of spam that otherwise makes it as far as relatively expensive spam scoring.

Written on 20 March 2016.
« The Spamhaus DBL does get hits even with basic checks
When you want non-mutating methods in Go »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Mar 20 03:00:25 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.