Learning a lesson about spam-related logging (again)
I recently mentioned some stats about how many clients do TLS with my sinkhole SMTP server. Today I was planning to present some broader stats from that server, showing how many clients made it to various points in the SMTP conversation with it. Then, unfortunately, I discovered that I'd shot myself in the foot as far as gathering this sort of stats was concerned.
(If I was running a pure 'accept everything' SMTP server these should be pretty boring stats. But as it happens I'm not; instead I'm going various things to get rid of uninteresting or overly noisy connections and uninteresting spam.)
My sinkhole SMTP server currently has two logs: a SMTP command log
(with additional notations for events like connections, TLS
negotiations, and so on) and a message log (which logs one line per
message that made it all the way through to fully receiving the
DATA payload). What I would like to do is generate stats on things
like how many connections there were, how many of them made it as
far as an
EHLO, how many made it to
MAIL FROM, and so on.
When I started I thought that I could just
grep my SMTP command
log and count how many hits I got on various things.
Well, no, not so fast. Take
EHLO, for example; a proper client
that successfully negotiates TLS will issue two
A related issue happens with, say,
RCPT TO commands; because many
clients pipeline their input, they may well send a
RCPT TO even
MAIL FROM failed. Bogus clients may equally send a
MAIL FROM even if their
EHLO failed (or may try
Given pure SMTP logs there are two ways to fix this. The first would
be to have a unique and distinct 'success' reply (or message) for
every different SMTP command; then I could search for the successful
replies instead of the commands being issued. The other would be
to use a process which reconstructs the state of each SMTP conversation
so it can tell successful commands from failed ones, suppress
EHLOs, and so on. You could even have the latter option
spit out a data record for each conversation with all of this
Unfortunately I do not have the former (successful SMTP reply
messages are both duplicated and varied) and creating something to
do the latter is too much work for now.
What this really points out to me is that I should have thought more about what sort of information I might want when designing my server's logging. In theory I knew from prior experience with Exim that raw logs could make it too complicated to generate interesting stats, but in practice it never occurred to me that I might be doing this to myself.
(Applications to what gets logged and so on by our real mail system are left as an application for me to think about, but this time around I do want to think about it and see if I can improve Exim;s logging or otherwise do anything interesting.)