Complications in spam filter stats in our environment
It seems useful to generate some kind of periodic stats on how our spam filtering system is doing, something like the one off stats I've done before. But this raises a question: what actual numbers are meaningful, and how do we get them?
There are two problems, one general and one pragmatic. The general
problem is the difference between counting per-recipient events and
per-message events. Many of our spam filtering things happen only to
a single recipient and only get logged that way; others happen to an
entire message and are logged that way. Even if and when one can go
from per-recipient logs to per-message logs, you need to decide how to
count multi-recipient messages where what happened to the recipients is
different.
The pragmatic problem is that Exim's logs make it very difficult
(and perhaps impossible) to do this per-recipient to per-message
reconstruction in the first place, at least with our current logging
options. Exim just doesn't log enough information when it rejects an
RCPT TO
to associate it with the log messages for when it processes
the whole message with its accepted recipients (if any).
(I sympathize with Exim somewhat, because the two sets of log messages
are produced at completely different levels. Exim's main logging is
focused on recording information about message routing and delivery,
and this whole routing process only starts happening once the message
has been accepted (eg, SMTP DATA
has succeeded). Exim also logs logs
certain exceptional events during the SMTP conversation, including
rejected RCPT TO
s, but all of this is before the main message
processing starts.)
What this means to me is that it's not really possible to count how
many messages we reject at SMTP time. I can count rejected recipients,
how many messages were rejected at DATA
time, and how many messages
were accepted, but I can't easily count how many messages had all of
their attempted recipients rejected. Or even some of their recipients
rejected, which complicates any numbers I could come up with for how
many messages we see that have multiple recipients in the first place.
(I wouldn't be surprised if single-recipient email was by far the norm today. I can easily see how many accepted recipients we had on average, but that's not the same thing as how many total recipients.)
(This is one of those entries where I don't wind up with an answer after writing it, just more questions.)
|
|