Learning a lesson about spam-related logging (again)
I recently mentioned some stats about how many clients do TLS with my sinkhole SMTP server. Today I was planning to present some broader stats from that server, showing how many clients made it to various points in the SMTP conversation with it. Then, unfortunately, I discovered that I'd shot myself in the foot as far as gathering this sort of stats was concerned.
(If I was running a pure 'accept everything' SMTP server these should be pretty boring stats. But as it happens I'm not; instead I'm going various things to get rid of uninteresting or overly noisy connections and uninteresting spam.)
My sinkhole SMTP server currently has two logs: a SMTP command log
(with additional notations for events like connections, TLS
negotiations, and so on) and a message log (which logs one line per
message that made it all the way through to fully receiving the
DATA
payload). What I would like to do is generate stats on things
like how many connections there were, how many of them made it as
far as an EHLO
, how many made it to MAIL FROM
, and so on.
When I started I thought that I could just grep
my SMTP command
log and count how many hits I got on various things.
Well, no, not so fast. Take EHLO
, for example; a proper client
that successfully negotiates TLS will issue two EHLO
commands.
A related issue happens with, say, RCPT TO
commands; because many
clients pipeline their input, they may well send a RCPT TO
even
though their MAIL FROM
failed. Bogus clients may equally send a
MAIL FROM
even if their EHLO
failed (or may try MAIL FROM
without even EHLO
ing first).
Given pure SMTP logs there are two ways to fix this. The first would
be to have a unique and distinct 'success' reply (or message) for
every different SMTP command; then I could search for the successful
replies instead of the commands being issued. The other would be
to use a process which reconstructs the state of each SMTP conversation
so it can tell successful commands from failed ones, suppress
post-TLS EHLO
s, and so on. You could even have the latter option
spit out a data record for each conversation with all of this
per-conversation information.
Unfortunately I do not have the former (successful SMTP reply
messages are both duplicated and varied) and creating something to
do the latter is too much work for now.
What this really points out to me is that I should have thought more about what sort of information I might want when designing my server's logging. In theory I knew from prior experience with Exim that raw logs could make it too complicated to generate interesting stats, but in practice it never occurred to me that I might be doing this to myself.
(Applications to what gets logged and so on by our real mail system are left as an application for me to think about, but this time around I do want to think about it and see if I can improve Exim;s logging or otherwise do anything interesting.)
|
|