== Learning a lesson about spam-related logging (again) I recently mentioned [[some stats about how many clients do TLS with my sinkhole SMTP server SpamAndTLSSpread]]. Today I was planning to present some broader stats from that server, showing how many clients made it to various points in the SMTP conversation with it. Then, unfortunately, I discovered that I'd shot myself in the foot as far as gathering this sort of stats was concerned. (If I was running a pure 'accept everything' SMTP server these should be pretty boring stats. But as it happens I'm not; instead I'm going various things to get rid of uninteresting or overly noisy connections and uninteresting spam.) My sinkhole SMTP server currently has two logs: a SMTP command log (with additional notations for events like connections, TLS negotiations, and so on) and a message log (which logs one line per message that made it all the way through to fully receiving the _DATA_ payload). What I would like to do is generate stats on things like how many connections there were, how many of them made it as far as an _EHLO_, how many made it to _MAIL FROM_, and so on. When I started I thought that I could just _grep_ my SMTP command log and count how many hits I got on various things. Well, no, not so fast. Take _EHLO_, for example; a proper client that successfully negotiates TLS will issue *two* _EHLO_ commands. A related issue happens with, say, _RCPT TO_ commands; because many clients pipeline their input, they may well send a _RCPT TO_ even though their _MAIL FROM_ failed. Bogus clients may equally send a _MAIL FROM_ even if their _EHLO_ failed (or may try _MAIL FROM_ without even _EHLO_ing first). Given pure SMTP logs there are two ways to fix this. The first would be to have a unique and distinct 'success' reply (or message) for every different SMTP command; then I could search for the successful replies instead of the commands being issued. The other would be to use a process which reconstructs the state of each SMTP conversation so it can tell successful commands from failed ones, suppress post-TLS _EHLO_s, and so on. You could even have the latter option spit out a data record for each conversation with all of this per-conversation information. Unfortunately I do not have the former (successful SMTP reply messages are both duplicated and varied) and creating something to do the latter is too much work for now. What this really points out to me is that I should have thought more about what sort of information I might want when designing my server's logging. In theory I knew from [[prior experience with Exim SpamStatsComplications]] that raw logs could make it too complicated to generate interesting stats, but in practice it never occurred to me that I might be doing this to myself. (Applications to what gets logged and so on by our real mail system are left as an application for me to think about, but this time around I do want to think about it and see if I can improve Exim;s logging or otherwise do anything interesting.)