2014-09-28
Learning a lesson about spam-related logging (again)
I recently mentioned some stats about how many clients do TLS with my sinkhole SMTP server. Today I was planning to present some broader stats from that server, showing how many clients made it to various points in the SMTP conversation with it. Then, unfortunately, I discovered that I'd shot myself in the foot as far as gathering this sort of stats was concerned.
(If I was running a pure 'accept everything' SMTP server these should be pretty boring stats. But as it happens I'm not; instead I'm going various things to get rid of uninteresting or overly noisy connections and uninteresting spam.)
My sinkhole SMTP server currently has two logs: a SMTP command log
(with additional notations for events like connections, TLS
negotiations, and so on) and a message log (which logs one line per
message that made it all the way through to fully receiving the
DATA payload). What I would like to do is generate stats on things
like how many connections there were, how many of them made it as
far as an EHLO, how many made it to MAIL FROM, and so on.
When I started I thought that I could just grep my SMTP command
log and count how many hits I got on various things.
Well, no, not so fast. Take EHLO, for example; a proper client
that successfully negotiates TLS will issue two EHLO commands.
A related issue happens with, say, RCPT TO commands; because many
clients pipeline their input, they may well send a RCPT TO even
though their MAIL FROM failed. Bogus clients may equally send a
MAIL FROM even if their EHLO failed (or may try MAIL FROM
without even EHLOing first).
Given pure SMTP logs there are two ways to fix this. The first would
be to have a unique and distinct 'success' reply (or message) for
every different SMTP command; then I could search for the successful
replies instead of the commands being issued. The other would be
to use a process which reconstructs the state of each SMTP conversation
so it can tell successful commands from failed ones, suppress
post-TLS EHLOs, and so on. You could even have the latter option
spit out a data record for each conversation with all of this
per-conversation information.
Unfortunately I do not have the former (successful SMTP reply
messages are both duplicated and varied) and creating something to
do the latter is too much work for now.
What this really points out to me is that I should have thought more about what sort of information I might want when designing my server's logging. In theory I knew from prior experience with Exim that raw logs could make it too complicated to generate interesting stats, but in practice it never occurred to me that I might be doing this to myself.
(Applications to what gets logged and so on by our real mail system are left as an application for me to think about, but this time around I do want to think about it and see if I can improve Exim;s logging or otherwise do anything interesting.)
2014-09-15
My collection of spam and the spread of SMTP TLS
One of the things that my sinkhole SMTP server does that's new on my workstation is that it supports TLS, unlike my old real mail server there (which dates from a very, very long time ago). This has given me the chance to see how much of my incoming spam is delivered with TLS, which in turn has sparked some thoughts about the spread of SMTP TLS.
The starting point is that a surprising amount of my incoming spam is actually delivered with TLS; right now about 30% of the successful deliveries have used TLS. This is somewhat more striking than it sounds for two reasons; first, the Go TLS code I'm relying on for TLS is incomplete (and thus not all TLS-capable sending MTAs can actually do TLS with it), and second a certain amount of the TLS connection attempts fail because the sending MTA is offering an invalid client certificate.
(I also see a fair number of rejected delivery attempts in my SMTP command log that did negotiate TLS, but the stats there are somewhat tangled and I'm not going to try to summarize them.)
While there are some persistent spammers, most of the incoming email is your typical advance fee fraud and phish spam that's send through various sorts of compromised places. Much of the TLS email I get is this boring sort of spam, somewhat to my surprise. My prejudice is that a fair amount of this spam comes from old and neglected machines, which are exactly the machines that I would expect are least likely to do TLS.
(Some amount of such spam comes from compromised accounts at places like universities, which can and do happen to even modern and well run MTAs. I'm not surprised when they use TLS.)
What this says to me is that support for initiating TLS is fairly widespread in MTAs, even relatively old MTAs, and fairly well used. This is good news (it's now clear that pervasive encryption of traffic on the Internet is a good thing, even casual opportunistic encryption). I suspect that it's happened because common MTAs have enabled client TLS by default and the reason they've been able to do that is that it basically takes no configuration and almost always works.
(It's clear that at least some client MTAs take note when STARTTLS
fails and don't try it again even if the server MTA offers it to
them, because I see exactly this pattern in my SMTP logs from some
clients.)
PS: you might wonder if persistent spammers use TLS when delivering their spam. I haven't done a systematic measurement for various reasons but on anecdotal spot checks it appears that my collection of them basically doesn't use TLS. This is probably unsurprising since TLS does take some extra work and CPU. I suspect that spammers may start switching if TLS becomes something that spam filtering systems use as a trust signal, just as some of them have started advertising DKIM signatures.