2013-03-24
Looking at how many external recipients inbound email goes to
My data on how many recipients our average inbound email has is in practice incomplete. It's quite possible for a single address here to expand into multiple destinations; some addresses are mailing lists and some people just forward their email to more than one place. So an interesting companion question is how many external recipients a typical email has. To make this more applicable to what I'm interested in, I'm looking at this only for email from the outside world.
As before, this covers 89 days of logs (but because it's a slightly different 89 days, the stats don't necessarily match up exactly). The first number is that out of 1.3 million inbound emails, only 30% had any external recipients at all; the remaining 70% went (directly or indirectly) only to internal recipients. The recipient count breaks down this way:
| 1 recipient | 91.7% |
| 2 recipients | 4.6% |
| 3 recipients | 1.9% |
| 4 recipients | 0.5% |
| 5 recipients | 0.4% |
As you might expect in an environment with mailing lists, some messages had very high external recipient counts. The champions were emails with between 247 and 266 external recipients, all of which seem to have been messages to department-wide mailing lists (which of course go to a whole lot of people who forward their email to outside addresses). But there weren't very many such emails; only 0.4% of the messages had 10 or more external recipients.
Unlike the inbound email case there don't seem to be any particular pattern for significant numbers of external recipients. This is what I'd expect given that the mapping between the number of inbound recipients and the number of external recipients is a fairly random one (since it depends on exactly who the email goes to).
2013-03-22
Looking at how many recipients our average inbound email has
One of the niggling problems of SMTP in the modern world (at least for us) is the mixed address problem, the fact that at DATA time your answer applies to all recipients. It would be much more convenient if all email messages had only a single recipient; then you could always apply just that recipient's content filtering views and enable much more rejection at SMTP time. Which leads to the question: how many recipients does an average message here have, especially inbound messages?
(Inbound messages are the most interesting ones, because those are the ones that all of our anti-spam stuff is applied to.)
Today, I decided to answer that question for our external MX gateway. The answer turns out to be that the overwhelming majority of email has only one recipient. The stats break down like this:
| 1 recipient | 93% |
| 2 recipients | 3.6% |
| 3 recipients | 1.2% |
| 4 recipients | 0.6% |
| 5 recipients | 0.4% |
| 6 recipients | 0.2% |
| 10 recipients | 0.2% |
(I think I'll stop there.)
This is from 89 days of logs, totaling 1.29 million messages received.
It counts only actual accepted recipients so some of these messages may
have had some of their RCPT TOs rejected already (I suspect that this
is not a really big factor but I haven't looked).
The largest number of (accepted) recipients for a single message is 82 recipients (one messages). There are a similar handful of other messages with large recipient counts. Interestingly the largest 'large' message count is for 20 recipients (but it's still only 0.09% of all messages). There seems to be a hard break at 20 recipients; only 98 messages out of the 1.29 million had more recipients than that.
This has been interesting. Before I did these stats I would not have expected single-recipient messages to be so totally dominating (even though I'm familiar with things like VERP that strongly bias some traffic towards that). Possibly much more of our inbound email is mailing lists (including spam lists) than I expect.
Sidebar: detailed message counts for 7-20 recipients
This actually forms an interesting pattern so I'm going to give you the raw data:
cnt recipients 1210 20 641 19 372 18 184 17 136 16 113 15 153 14 173 13 289 12 820 11 2081 10 1428 9 1568 8 1925 7 2156 6
(for 2-7 there is a steady dropoff.)
My guess is that a bunch of mailing list software really prefers to cut things at nice even (small) numbers of recipients.
2013-02-28
Looking at whether Zen-listed IPs keep trying to send us email
Here's a question: when an IP address listed in the Spamhaus Zen gets rejected, does it come back later or are most visits a one-time thing? This time I pulled 90 days worth of logs, extracted each day's rejections from Zen-listed IPs, and checked to see how many IPs showed up in more than one day's logs.
(Because an IP could be trying to deliver stuff right when the logs roll, the safe question is how many IPs show up in more than two days worth of logs.)
The first answer is that we have some persistent IPs but not anything that is really hammering on us. Well, at least if you look at the data this way. Here, have a table:
| 212.174.85.130 | 24 days | SBL107558 |
| 89.204.63.228 | 20 days | SBL168886 and the PBL |
| 189.112.34.215 | 18 days | SBL153384 |
| 82.165.159.34 | 15 days | web.de; SBL175032 |
| 82.165.159.35 | 13 days | web.de and SBL175032 again |
| 82.165.159.3 | 10 days | web.de but now SBL175030, which is basically the same as SBL175032; web.de is clearly good at getting SBL-listed. |
| 217.133.203.34 | 10 days | SBL157999 |
| 115.93.88.50 | 10 days | In the PBL |
| 82.165.159.2 | 9 days | web.de yet again, SBL175030 |
| 218.38.136.79 | 9 days | SBL146938 |
| 216.104.35.85 216.104.35.86 216.104.35.90 |
9 days | No longer listed. |
| 200.68.99.196 | 9 days | SBL CSS |
| 186.1.192.23 | 9 days | SBL172432 |
(This table probably doesn't look that nice in the syndication feed.)
Now things get interesting, because I noticed a pattern and went digging. All of the IPs from 216.104.35.83 through 216.104.35.94 got rejected by us at various times in the 90 days, and all of them were rejected on multiple days. Even more interesting, the rejections stretch from day 11 through day 90 (although not continuously).
(The gaps in rejections could be either because they stopped sending to email addresses that were rejecting them, because they dropped out of Zen temporarily, or both of the above.)
This prompted me to look at /24-based reoccurrence, and there things get more interesting:
| 173.242.121.0/24 | 46 days | One IP still in the SBL CSS |
| 198.64.159.0/24 | 45 days | 13 of 23 IPs still in the SBL CSS |
| 216.104.35.0/24 | 43 days | Nothing still listed out of the 12 IPs we rejected from this |
| 82.165.159.0/24 | 30 days | web.de, mentioned above; all four IPs still in their SBL listings |
| 177.47.102.0/24 | 27 days | SBL136747, a /24 listing dating from August 14, 2012 |
| 212.174.85.0/24 | 26 days | SBL107558; one of the single IPs made it into the single-IP list |
| 178.210.168.0/24 | 25 days | Multiple IPs still in the SBL CSS |
| 216.229.59.0/24 | 22 days | Multiple IPs still in the SBL CSS |
I'm going to stop here because the next '/24' is actually due to a single IP (89.204.63.228) so we're reaching the crossover point (besides, I'm doing this all more or less by hand).
What really surprises me from looking at the by-/24 breakdown is how active the SBL CSS clearly is. If someone told me that the SBL CSS was now the largest single contributor for spam rejections, I wouldn't be surprised.
(I can't verify that without changing our mail configuration to add more logging (since SBL CSS listings expire, we'd have to capture the Zen results at the time of the actual rejection). Sadly my curiosity is not worth that.)
(This is kind of a followup to looking to see if IP addresses persist in Zen.)
Sidebar: a way in which these results may not be representative
We do Zen-based rejections only for some email addresses (only those that have opted in to it). So a Zen-listed sending IP wouldn't necessarily see continuous rejections if they kept sending to us. It depends on what email addresses they are sending to that day and they could have a day with no rejections.
I haven't tried to dig into the raw logs to see if this is happening for
some of these IPs, or in general if these IPs saw a mix of successful
deliveries and rejections or if they saw uniform rejections. I don't
know if I'll ever do this level of analysis, since it's going past what
I can easily bash together with shell scripts and awk. Past the land
of shell scripts lies the land of real work.
2013-02-26
Looking at whether (some) IP addresses persist in zen.spamhaus.org
After writing my entry on the shifting SBL I started to wonder how many IP addresses we reject for being SBL listed stop being SBL listed after a (moderate) while. I can't answer that directly, because we actually use the combined Zen Spamhaus list and we don't log the specific return codes, but I can answer a related question: how many Zen-listed IP addresses seem to stay in the Zen lists?
To check this, I pulled 10 days of records from January 18th through January 27th, extracted all of the distinct IPs that we found listed in zen.spamhaus.org, and re-queried Zen now to see how many of them are still there. Over that ten day period we had 613 Zen-listed IP addresses; today, 534 of them are still in the Zen. So a fairly decent number stay present for 30 days or more.
(Technically some of them could have disappeared and then reappeared.)
I also pulled specific return codes for all of those IP addresses, so I can now give you a breakdown of why those 534 addresses are still present:
- 420 of them are in Spamhaus-maintained PBL data. There's no single
really big source, but 46 of them are from Beltelecom in Belarus
(AS6697)
and 23 are from Chinanet (AS4134).
- 70 of them are in the XBL, specifically in the CBL.
- 56 are in the SBL. There's no really big source, but five IPs are
from 177.47.102.0/24 aka SBL136747, four are from
5.135.106.0/27 aka SBL173923, and two are
from 212.174.85.0/24 aka SBL107558.
(Two of those SBL listings are depressingly old, not that I am really surprised by long-term SBL listings by this point.)
- 47 of them are in ISP-maintained PBL data.
- 9 of them are in the SBL CSS, which is pretty impressive and depressing because SBL CSS listings expire fairly fast.
An equally interesting question is how many of those 79 now-unlisted IPs are listed in some other DNS blocklist. The answer turns out to be a fair number; 60 are still listed on some DNS blocklist that I have in my program to check IPs against a big collection of DNSBls. Many but not all of the hits are for b.barracudacentral.org (which is not a DNSBl that I consider to be really high quality; it seems to be more of a hair-trigger lister).
(I'm out of touch with what's considered a high-quality DNSBl versus lower-quality ones so I'm not going to offer further reporting or opinions.)
2013-01-31
The shifting SBL, as experienced here
I still sort of run a mail server which gets a low enough connection volume that I can monitor the logs directly. This MTA rejects connections from SBL listed IPs, at a sufficiently low volume that I almost always look into the actual SBL listing (partly because I may want to apply my own blocks, including IP-level ones).
In the beginning, the volume of SBL hits was low but most of the actual SBL listings were for network ranges (not just single IPs) owned by what I privately characterized as 'the worst of the worst'. These were the people and organizations who spammed so many people so often that they finally convinced the SBL that they were very definitely dirty. Hits were rare partly because there never were really large numbers of these people, partly because I and other DNS blocklists blocked such people before the SBL, and perhaps partly because these people just didn't target me very often.
(I and a fair number of other people felt that the SBL was far too conservative and gave spammers way too many chances, but the SBL had its standards and that was it.)
I'm not sure when things started shifting, but this is not the pattern that I see today. The modern SBL experience is that most SBL hits are from single IPs that are listed as probably compromised or, to a lesser extent, from IPs that are on the SBL CSS. Hits from genuine SBL listed dirty blocks seem to be rare.
Out of curiosity I pulled eight days of records from the department's main mail gateway and looked through them for SBL rejections. Of the 80 IPs that (still) had SBL listings, the SBL CSS accounts for 35, 177.47.102.0/24's SBL136747 listing for four, and a random sampling of everything else shows single (compromised) IPs.
(Yesterday is a bit different. There are 27 IPs that are still SBL listed, with 21 of them on the SBL CSS. But two of the remaining were for bad netblocks and one IP was listed for spammer hosting. The other three were the usual single compromised machine pattern.)
I don't know what this means, if anything; I just find it interesting.
(I can come up with all sorts of potential theories but I will spare you all; they're generally obvious anyways. Just in case there's any doubt, I should note that I'm all for the SBL listing all sorts of spam sources and so I have no objection to the apparent new inclusion of compromised machines that are spewing advance fee fraud and phish spam and so on.)
2013-01-05
What I think changed to make spam deliveries not cost-free
As I covered in my entry on why stupid spamming is wasteful, I used to think that spam deliveries were basically free (and so spammers shotgunned everything because, well, why not) and now I feel otherwise. This is not just a shift of my view; I actually feel that the situation itself changed. Which raises the obvious question of what changed to do this.
My tentative answer is that spamming became commercialized, and specifically that it became a sophisticated business. As it did so, we saw it increasingly segment into subfields with specialists and services as people realized both that you could make money selling the specialized services and that it made more sense to buy the services than do the work yourself (or alternatively, the existence of buyable services drew people into spamming who previously would not have done so). In particular, one thing that happened is that people began to rent out and sell spam sending capacity in various forms; as the spam business became sophisticated, people could buy and sell so much time on so many compromised proxies or so many delivery attempts or the like. This put a value on sending capacity, even if it was your own organically developed sending capacity (since you could always make money by renting it out to other people instead of trying to send out your own spam).
I also think that sending may have gotten more harder and expensive (in terms of time and lost opportunities). Back in the early parts of the 00s, things were in a sense really bad; there were oceans of open proxies (and before them oceans of open relays), ISPs generally didn't care, anti-spam precautions were relatively undeveloped (even at large providers), and so on. Since then many things have shifted quite far. The open proxy problem has gotten much better on many fronts (ISP cooperation, effective DNS blocklists, etc), anti-spam precautions have gotten more sophisticated in ways that hinder rapid sending, and so on.
(One inobvious but important shift is that many mailers will now drop your SMTP connection if you try to do unauthorized pipelining. Back at the height of the open proxy era spam senders simply blasted an entire SMTP conversation at you in one go, ignoring return codes and speeding up their lives. Now that doesn't really work (and spammers have by and large stopped trying to do it as a result).)
2012-12-29
Why I think that stupid spamming is actively wasteful
In reaction to my last entry, a commentator wrote:
You assume it's more cost efficient for the spammer to fix his system rather than just have a slightly higher percentage of broken addresses in his list than otherwise. I'd guess the broken addresses cost the spammer virtually nothing in resources or time.
I used to feel this way, that spamming was basically free, but I've shifted my views over time. My current belief is that in today's Internet environment, sending spam to addresses is not so cheap that it's pointless to measure and I actually suspect that modern spammers are often email-rate-limited and so sending to bad addresses directly displaces email that could go to potentially good addresses.
First off, let's take an easy case, that of people exploiting webmail systems via compromised accounts (as happened with us). Whether the spammers are using 'mules' to enter things by hand or they're driving the webmail systems by automation, it seems extremely likely that the spammer will have a relatively low sending rate limit (either the mules can only type and click so fast, or the webmail server software can and will only respond so fast). Thus, every clearly bad email address emailed to is a possibly good email address not mailed to.
(I'm making what I feel is the safe assumption that spammers have basically an infinite supply of potentially good email addresses they could spam.)
But let's suppose that the spammer has no message submission problems; they can stuff the queue with as much email to as many addresses as they want. The next limitation is the sending mailer itself. Spammers very often use compromised machines with whatever MTA setup the machine already has, a setup that is extremely unlikely to be set up for high sending volumes. The MTA will likely only be able to do DNS lookups and route messages so fast and make so many simultaneous delivery attempts at once, either through software limits or through machine capacity limits. Here again, bad addresses clearly displace potentially good ones.
(It's not uncommon for me to connect to the SMTP port on a machine that's sending out spam and have it report a temporary failure because of resources exceeded.)
Finally we have the actual delivery. Ignoring greylisting, I've seen clear evidence that large mail providers pay attention to delivery volumes and especially delivery volumes to bad addresses. Even here we've periodically seen temporary SMTP failures from the likes of GMail with messages to the effect of 'slow down, you're trying to send us too much too fast'. Every address a spammer tries to send to at such providers is one more point in their internal scoring systems for 'this IP is probably sending spam', and probably even more so for bad addresses; again bad addresses are displacing potentially productive ones and pushing the sending IP that much closer to when the provider will choke it off. Greylisting has similar but smaller effects (since it won't necessarily choke off future potentially good email addresses, just delay things). The effects of all of this is going to be magnified if the spammer is hijacking a compromised machine with a normal MTA that's set up for normal mail volumes.
You can build very custom infrastructure that has no problems with all of this (although you're still going to run into issues with destinations choking you off for too much volume). But I don't think most spammers these days are using anything that sophisticated, so all of those spammers are very likely to be email-rate-limited in their spamming.
2012-12-28
A spammer that is not the brightest light in the box
I'm fond of saying that spammers are generally not stupid; they do what works and they're quite good at figuring out what that is. However, every so often a spammer comes along who quite clearly challenges or outright breaks this view.
Here's a snippet from a recent SMTP conversation that one of my machines logged:
remote from [208.86.167.19] HELO postoffice.wieck.com 250 Hello postoffice.wieck.com MAIL FROM:<REDACTED@wieck.com> 250 Ok (verified) RCPT TO:<"d..."@REDACTED.org> 554 no such local user
What makes this stand out is the RCPT TO address. For those who've
never run into this, this (without the quotes) is how Google's Usenet
interface has presented poster email addresses for quite a while. Such
addresses are deliberately obfuscated and have never worked; we can
see how badly broken they are by the fact that they have to be quoted to
make them RFC-legal even as RCPT TO addresses. Any vaguely smart
spammer would not be dealing with these addresses.
Despite this, this spammer has wasted time and effort collecting these addresses and sending spam to them. This is a genuine waste; someone has carefully scraped and stored these addresses, someone else may have purchased them, and now someone is wasting resources attempting to deliver email to them (resources which could have been spent delivering spam to more viable addresses, ones that at least potentially could pay off). All of this is objectively stupid and worse, it's obviously so.
2012-11-17
Why Google's handling of multiple domains on inbound messages is okay
It started with a tweet by @xlerb (Jeb Davis):
Today I learned that Google thinks they can unilaterally redefine SMTP: <link> (warning: gratuitously shouty forum comments)
To summarize the link: Google is the MX target for all sorts of domains,
due to various services they offer. Google's MX servers will now only
accept destination addresses in a single domain per transaction; if you
try to RCPT TO to addresses at multiple (Google-hosted) domains in
the same transaction, all but the first domain will get 4xx temporary
failures.
I'm not particularly fond of Google's handling of email, but as I tweeted I come down on Google's side here. First off, this can't possibly be called 'redefining SMTP'. Mail servers have always been allowed to temporarily defer some recipients for any reason whatsoever, including random software limitations, their own convenience, and obscure internal policies. Anyone who expects all recipients to always be accepted on the first delivery attempt has not been paying attention to the modern Internet mail environment for years; many, many systems behave otherwise (ours included). The only vaguely novel thing Google is doing is that they are being clear about why addresses are getting temporary failures.
(It would be redefining SMTP if Google was giving 5xx permanent failures in this case and telling everyone to fix their software to not do this, but not even Google is that stupid.)
Second, there is an excellent reason why Google might want to do this; it is my old friend the lack of partial success for message delivery. If different Google-hosted domains can have different policies on what message contents can be sent to them (perhaps Google allows the domain owners to control this), Google needs to make sure that it's never in a situation where some recipient domains would accept the message but others would refuse it. Giving each domain its own MX IP address (or set of them) is not exactly a scalable solution (not at Google's scale), so Google has to do it the other way; they can only ever accept a single domain per transaction, so only a single domain's policy will apply to the message contents.
Finally, my view is that any significant mailing list operation that's having problems about this is probably doing things wrong. For a start, any mailing list using VERP will not be affected by this, because with VERP each transaction has only a single recipient. And you should really, really be using VERP and automated bounce handling if you're running a mailing list of any appreciable size.
(Note that in theory, ordinary people can run into this routinely; all you need to be doing is having a conversation with several people who are all hosted through Google but at different domains. Depending on how fast your mail system does retries, some of the people in the conversation may get messages much slower than others. In practice, who knows what special magic Google is doing.)
All of this is something that goes well beyond what Google is doing right now. Every mail server that wants to make accept/reject decisions based on the both the message contents and the destination addresses (or domains) faces this issue, and there are no better solutions than what Google is doing. If you want to allow people or hosted domains to reject during SMTP (and you do), and you want to give them some control over what gets rejected, you're going to wind up doing the same thing.
(And you should not feel particularly broken up about it. Batching multiple addresses with different destination domains together into a single transaction when they all MX to the same thing is an optimization, not a fundamental feature of SMTP. It just happens to be a common optimization in mailers, partly because it's cool in a way that attracts programmers like bees to honey.)
2012-11-09
Some amusing cut and paste work from spammers
Recently I got a modest spate of advance fee fraud spam attempts with the interesting feature that they either claimed to be from 'Federal Bureau Of Investigation Seeking To Wiretap The Internet' or at least contained some variant of 'FBI seeking to wiretap the Internet' in addition to the agency name. Advance fee fraud come-on messages are almost never well written to start with but this text is relatively glaringly out of place, which is part of why it stood out and stuck in my mind. The messages have some other similarities but also their fair share of differences, so I'm not sure I can conclude that it's the work of a single group (I suspect that advance fee fraud spammers aggressively copy from each other's come-on messages). My records say that this is not a new thing; the oldest sample I could spot using a quick search pattern dates from the end of 2008.
(It looks like an Internet search for this phrase will turn up lots and lots of archived samples.)
What interests me is speculating on where this odd text comes from and what it implies about how spammers operate. In general, the 'seeking to wiretap' text is clearly out of place in the spam messages; there is no attempt to weave it into the come-on text and it's generally more or less positioned as part of the FBI's name. The obvious guess about what happened is that at some point an initial spammer was looking for the FBI's full name, did an Internet search, and wound up on a news story where this text was the main heading or the like instead of the FBI's own page. Operating without enough contextual knowledge they lifted the entire text, copied it into their spam, and it propagated from there. That the text continues to show up with some regularity suggests that it's become established in some mainline of advance fee fraud messages that lots of people copy from.
This is where I start thinking of similarities to evolutionary biology, where odd and unimportant features of a successful organism can sort of come along for the ride as it propagates. This bit of text feels like one of them; I doubt that it itself does anything to improve the spam's success rate, but it could well be part of a relatively successful initial advance fee fraud message that has been widely copied and imitated more or less wholesale since then. This is especially so because the text usually appears as an initial title block and I can certainly believe that those just get copied back and forth without anyone paying them much attention.
(While there are theories that advance fee fraud spammers deliberately make their come-on messages relatively extreme and obvious in order to hook only the most credulous, I don't believe that this text is being included deliberately as part of that filtering. To use the text as filtering seems more than a little bit too subtle and clever for both the spammers and the audience they are allegedly filtering for.)