Learning a lesson about spam-related logging (again)

September 28, 2014

I recently mentioned some stats about how many clients do TLS with my sinkhole SMTP server. Today I was planning to present some broader stats from that server, showing how many clients made it to various points in the SMTP conversation with it. Then, unfortunately, I discovered that I'd shot myself in the foot as far as gathering this sort of stats was concerned.

(If I was running a pure 'accept everything' SMTP server these should be pretty boring stats. But as it happens I'm not; instead I'm going various things to get rid of uninteresting or overly noisy connections and uninteresting spam.)

My sinkhole SMTP server currently has two logs: a SMTP command log (with additional notations for events like connections, TLS negotiations, and so on) and a message log (which logs one line per message that made it all the way through to fully receiving the DATA payload). What I would like to do is generate stats on things like how many connections there were, how many of them made it as far as an EHLO, how many made it to MAIL FROM, and so on. When I started I thought that I could just grep my SMTP command log and count how many hits I got on various things.

Well, no, not so fast. Take EHLO, for example; a proper client that successfully negotiates TLS will issue two EHLO commands. A related issue happens with, say, RCPT TO commands; because many clients pipeline their input, they may well send a RCPT TO even though their MAIL FROM failed. Bogus clients may equally send a MAIL FROM even if their EHLO failed (or may try MAIL FROM without even EHLOing first).

Given pure SMTP logs there are two ways to fix this. The first would be to have a unique and distinct 'success' reply (or message) for every different SMTP command; then I could search for the successful replies instead of the commands being issued. The other would be to use a process which reconstructs the state of each SMTP conversation so it can tell successful commands from failed ones, suppress post-TLS EHLOs, and so on. You could even have the latter option spit out a data record for each conversation with all of this per-conversation information. Unfortunately I do not have the former (successful SMTP reply messages are both duplicated and varied) and creating something to do the latter is too much work for now.

What this really points out to me is that I should have thought more about what sort of information I might want when designing my server's logging. In theory I knew from prior experience with Exim that raw logs could make it too complicated to generate interesting stats, but in practice it never occurred to me that I might be doing this to myself.

(Applications to what gets logged and so on by our real mail system are left as an application for me to think about, but this time around I do want to think about it and see if I can improve Exim;s logging or otherwise do anything interesting.)

Written on 28 September 2014.
« Changing major version numbers does not fix compatibility issues
Don't split up error messages in your source code »

Page tools: View Source.
Search:
Login: Password:

Last modified: Sun Sep 28 22:50:13 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.