Complications in spam filter stats in our environment

May 27, 2012

It seems useful to generate some kind of periodic stats on how our spam filtering system is doing, something like the one off stats I've done before. But this raises a question: what actual numbers are meaningful, and how do we get them?

There are two problems, one general and one pragmatic. The general problem is the difference between counting per-recipient events and per-message events. Many of our spam filtering things happen only to a single recipient and only get logged that way; others happen to an entire message and are logged that way. Even if and when one can go from per-recipient logs to per-message logs, you need to decide how to count multi-recipient messages where what happened to the recipients is different. The pragmatic problem is that Exim's logs make it very difficult (and perhaps impossible) to do this per-recipient to per-message reconstruction in the first place, at least with our current logging options. Exim just doesn't log enough information when it rejects an RCPT TO to associate it with the log messages for when it processes the whole message with its accepted recipients (if any).

(I sympathize with Exim somewhat, because the two sets of log messages are produced at completely different levels. Exim's main logging is focused on recording information about message routing and delivery, and this whole routing process only starts happening once the message has been accepted (eg, SMTP DATA has succeeded). Exim also logs logs certain exceptional events during the SMTP conversation, including rejected RCPT TOs, but all of this is before the main message processing starts.)

What this means to me is that it's not really possible to count how many messages we reject at SMTP time. I can count rejected recipients, how many messages were rejected at DATA time, and how many messages were accepted, but I can't easily count how many messages had all of their attempted recipients rejected. Or even some of their recipients rejected, which complicates any numbers I could come up with for how many messages we see that have multiple recipients in the first place.

(I wouldn't be surprised if single-recipient email was by far the norm today. I can easily see how many accepted recipients we had on average, but that's not the same thing as how many total recipients.)

(This is one of those entries where I don't wind up with an answer after writing it, just more questions.)

Written on 27 May 2012.
« How CSLab currently does server side email anti-spam stuff (version 2)
How to do a very cautious LVM storage migration »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun May 27 02:30:44 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.