Wandering Thoughts: Recent Entries For 2012

2012-04-27

The case of the Twitter spam I don't understand

It's probably not news to anyone on Twitter that Twitter has spammers (every popular service has spammers, it's a rule of nature). In fact Twitter has several forms of spam, mostly revolving around drawing your attention with @-mentions. Much of what these spammers are up to is pretty immediately obvious and thus uninteresting, which is the state of affairs I'm used to. With pretty much all forms of spam on all services, it's almost always pretty obvious what the spammer is up to and what benefit they hope to get out of their spamming.

But not always. Every so often I run into something that is clearly spammy, where the people involved are up to no good, but I don't understand what exactly they get out of their activities. On Twitter the spam I don't understand is certain sorts of follow-spamming, where accounts follow me without any attempt to message me or otherwise get my attention (some follow spam has relatively obvious purposes, for example to get me to look at the account's profile to see some advertising there). When I run into a situation like this, what it says to me is that I don't fully understand the service I'm using and its environment, and the spammers do. If spammers see some advantage to following my Twitter account without me ever following them back, then they understand Twitter better than I do; there's something about the situation that I'm missing.

(As I've said before, spammers are not stupid in the aggregate. If there are a bunch of spammers doing something, it is because it works; it achieves results that they want.)

The corollary to this is that if you run a service and you see spammers doing something mysterious on it that you don't understand, you probably have a problem. Unless you're absolutely sure that the spammer actions are having no effect at all on your service (ie the only thing they're doing is creating logfile entries in private logs), you should assume that the spammers have spotted something clever that you've missed.

(In my case I'm not sure I care enough about Twitter to go digging into what the follow spammers are up to. Note that Twitter is clearly aware that follow spamming is a potential problem, as I've noticed that I don't always get the email notices about Twitter accounts following mine.)

TwitterSpamIgnorance written at 00:17:57; Add Comment

2012-04-19

An interesting experience with IP-based SMTP blocks

As I've mentioned before, I still run a mailer on my office workstation. Since it gets almost no real email any more, I've become more and more aggressive about using kernel level IP-based blocks on my SMTP port and applying them to relatively large network areas when other bits of my anti-spam heuristics detect something they don't like from an IP address in the area (this follows a familiar pattern). I also reboot my workstation relatively frequently (Fedora releases a lot of kernel updates) and when I do this, all of the current blocks are re-established from scratch. This gives me an interesting way to assess how active various sources are; I can simply look at who bubbles up to the top of the packets-blocked counts.

Before I started paying attention to this recently, I expected the result to be roughly correlated with the size of the network area I was blocking. This may be generally true, but there are some sources that stand out as unusually active. In particular one source has been on top of my most packets dropped lists for quite a while now, and with remarkable consistency; I can reboot my machine and they show up to bang on the door again almost immediately.

(This is not a good sign for various reasons.)

So today I would like to give, well, something to 81.92.112.0/20, a netblock assigned to one 'Emailvision'. According to their website, they are an 'Email & Social Marketing' firm; I have not looked for details, because there is a limit to how much I am willing to read from the website of anyone who calls themselves that. This is especially the case when the entire reason I know about them is that I have received unsolicited email from their address range.

On somewhat further investigation, it looks as if they are some sort of mailing list management firm that people use to send out bulk email of all sorts. Bulk email being bulk email, they attract spammers. Service providers being service providers, not taking these people's money (or noticing when they clearly have dirty lists) is unprofitable.

And so they remain the top source of rejected packets sent to my machine's SMTP port, as they have been for some time. I don't expect this to change any time soon.

(They do seem to send a certain amount of email to our regular mail system, from a variety of origin domains. On a casual inspection, our spam filtering system doesn't seem to consider it spam, which is what I would sort of expect in this situation.)

EmailvisionBlock written at 02:17:21; Add Comment

2012-03-28

Ultimately, abuse issues have to be handled by humans

Time and time again, people have tried to create entirely automated systems for detecting, identifying, and dealing with spam on their services. Time and time again, they've ultimately failed; their systems may stop a great deal of spam, but enough gets through despite it.

(Not infrequently the spam that gets through looks, from the outside, as if it should be trivial to recognize. I think there is a deep reason for this, which we'll get to.)

There is a shallow and a deep reason for this failing. The shallow reason is that humans (and spammers are humans) will relentlessly game any set of automated rules until they can find weaknesses and then drive as many trucks as possible through whatever weaknesses they've found. If your service is at all popular, there will be far more smart spammers trying to game the automation than there are smart people writing the automation, placing your automation writers in an arms race they almost certainly cannot possibly win. The deep reason is that you are guaranteed to have weaknesses, because it's essentially impossible to make automated rules as smart as they need to be due to the fundamental problem of spam of stopping bad content while letting good content through. Whatever 'bad' and 'good' are, which is one reason you need people.

(As for why spam that gets through automated systems often looks obvious to people, it's because there's no reason for spammers to add variety once they've gotten past the automated systems. In fact they can be blindingly obvious so long as they evade the automation.)

All of this means that places really do need humans to handle their abuse issues; automation can help by getting obvious things, but it will never entirely replace humans paying attention. The corollary is that places need not just some people but enough people for the volume of abuse they get. This is an extremely unpopular view since abuse is a cost center and everyone loves the idea of automating your cost centers to make them go away, but by this point we have plenty of experience that this just doesn't work for abuse.

(The corollary is that anyone who relies on automation instead of staffing up their abuse department to adequate levels is not actually serious about spam, regardless of what they say. They may not be actively for spam and spammers on their service, but to use the fine George Orwell phrase they are objectively pro-spam. Application to various Silicon Valley firms are left as an exercise for the reader.)

HumanAbuseHandling written at 00:52:53; Add Comment

2012-03-11

A CBL false positive reveals a significant issue with the CBL

We were notified today that one of our IPs, 128.100.1.90, had been listed on the CBL (and thus had been pulled in by Spamhaus in their XBL and Zen DNSBLs). There's only one problem with this: there's no machine at that IP address and never has been, and even if there was such a machine it would not have been allowed to do any external traffic by our firewall.

(This subnet is only present on a couple of switches in our machine room and is not exposed outside of it; it's not even carried on our general inside-department backbone.)

However, there is a long standing issue where some people out there in the world are using addresses in 128.100.0.* and 128.100.1.* on their internal networks. These addresses leak into Received: headers and provoke spam complaints when these companies are exploited to send spam. Now they apparently also cause CBL listings.

(Back when I first saw this it was primarily from machines in Europe, but this time it appears to be a bad machine and organization in Brazil.)

Unfortunately, this is very bad. The only way for the CBL to pick up these IP addresses is for CBL feeders to parse the Received: headers in the mail they receive. Let me repeat that: the CBL is listing IP addresses based on parsing Received: headers from untrusted third party machines. And demonstrably this parsing can and has been fooled into false positives, listing machines that are not spam sources.

What we are seeing here is only one demonstration of what can go horribly wrong when you do this. As far as I am concerned, this significantly lowers the trustworthiness of CBL results. It used to be that I could trust that everything in the CBL was listed because CBL honeypots had direct experience with bad behavior from that IP. Now it is clear that for some or perhaps many listed IPs, the CBL has at best indirect 'evidence', evidence that can easily be wrong. Probably the CBL is still mostly correct and this sort of thing is rare, but I had previously thought that this sort of false positive was actively impossible in the CBL.

CBLFalsePositiveProblem written at 00:41:23; Add Comment

2012-02-26

How much spam is forged as being from who it's sent to?

After doing the stats for the most popular sender domains for spam and discovering that the most popular thing was to use our domains, I was left with a very related question: how much spam is forged to come from the victim themselves?

As near as I can tell, the answer is almost all of the spam that's forged as from our domains is in fact forged as coming from the victim themselves (or, for multi-recipient messages, as coming from the first recipient). Based on our current set of 45 days of logfiles, that's about 8.3% of all messages that got spam-tagged. I suppose that this makes sense; after all, there's no need to take the risk of making up addresses on the remote system when you already have some, ie the ones you're sending spam to.

(As before, I checked only high-rated spam.)

The obvious corollary question to ask is how many non-spam messages match this criteria. The answer appears to be that almost none do, which is not really surprising. Given ad-hoc mailing lists and the like, it's possible for legitimate email to loop around in this way or for people to copy themselves when they're sending email through an outside SMTP server, but it's probably not going to be very common in most user populations.

For a while, I've believed that spammers like forging system addresses, especially postmaster. This turns out to be wrong; vanishingly little (high-scoring) spam is sent as from anyone's postmaster, and none is forged as from our postmaster address. Virus spammers may do that, but viruses are still very rare in our mail stream. I admit that this surprises me.

(Working with the logfiles for our spam filtering and tagging system has shown me that I need a specialized matching and extracting program that works with log lines of the form 'key=value key=value key=value ...', especially with some keys repeated several times. Awk is not a really good fit for these files. Creative use of tr can help when I only want a single field, but things fall down when I want several.)

ForgedFromSelf-2012-02-26 written at 01:45:49; Add Comment

2012-02-18

The most popular sender domains for spam messages sent to here

Every so often I get curious about crazy spam-related statistics. Today's curiosity started out as a simple question: given that spammers generally forge the original addresses on their messages, do they like picking on some domains or do they distribute them randomly around? As it happens, identifying messages that have forged senders is a little bit too much work for a blog entry, so I am answering the closely related question of what are the most popular domains to appear as the sending domain on spam.

My data comes from the last 45 days of our spam tagging and filtering system. This system assigns messages a spam score; based on the analysis of the score distributions from back here, I decided to look only at messages that scored between 90 and 100 points. Over the past 45 days it turns out that there were just over 300,000 such messages.

The top sender domains for these messages break down as follows:

our own domains 27200+
yahoo.com 27000
yahoo.co.jp 17800
gmail.com 14000
bbb.org 7200
nacha.org 6500
ymail.com 6300
returns.groups.yahoo.com 4600
advertise-bz.cn 3500

In terms of top level domains, it shouldn't surprise anyone that .com is by far the most forged, followed by .jp, .net, .org, and then .cn.

Before I did these numbers, I probably wouldn't have predicted that forging valid users on our own domains was so popular (it's almost 10% of the total high-scoring spam messages). This probably explains why my earlier rejection stats showed that we had a surprisingly high rate of sender addresses that were nonexistent local users.

Based on spot checking the distribution of origin IP addresses for these domains, most of them really are mostly forged. Unfortunately, the standout exception is Yahoo Groups; almost all of those messages really do come from Yahoo's mail servers. It appears that spammers have probably infested Yahoo Groups, much like they seem to have done so on Google Groups.

The other exception is advertise-bz.cn. Messages claiming to be from it appear to be emitted from only a narrow set of IP address ranges in China. I spot-checked the destination addresses here and they don't appear to just be repeatedly spamming only a few unlucky people. Some investigation shows that this is actually a ROKSO-listed spammer with several SBL listings; given the SBL listings, this spam source is also having some amount of their email rejected outright at SMTP time.

MostAbusedDomains-2012-02-18 written at 23:57:57; Add Comment

2012-01-29

Thinking about spam rejection and abuse addresses

Somewhat recently we got a spate of spam messages to our abuse address, which set me to thinking about the mostly theoretical issue of how to treat email to it.

(It's a mostly theoretical issue for us because the volume of spam and other email to our abuse address is very low in general, so we're not at all likely to change anything about it.)

On the one hand, visible spam rejection of email to abuse addresses is one of the things that really gets on people's nerves; it's famous for rejecting real spam complaints because, of course, they contain spam. Your spam, that people are trying to complain about.

On the other hand, email to abuse is going to go through our spam scoring system and get tagged if the system thinks it's spam. Pretty much everyone here either discards spam-tagged email outright or filters it to a separate folder. My mail filtering deliberately excludes email to abuse (among a few other things), but I don't know if anyone else either bothered or even thought of it; it's not necessarily something that comes to mind when you're setting up personal email filtering.

And finally, I can't think of any actual real email to our abuse address that we've gotten in the last five years or so (since I moved to here). It's all been spam. So as a practical matter, any filtering or rejection that we do on abuse email is unlikely to affect real complaints, because we don't get real complaints (hopefully because our users and machines don't generate spam, as opposed to people just not complaining about it).

(The other aspect of email to our abuse address is that I suspect most people are going to complaint to the central university-wide abuse address instead of abuse at our specific subdomain. The central people will then get in touch with us through our internal contact address, not our abuse address.)

This is of course a specific instance of the general spam rejection versus spam filtering dilemma. If you reject email people at least know; if you filter, there's at least a theoretical chance that you'll recover from filtering mistakes. The stakes are higher for the abuse address because it is one of the addresses that has a very high chance of false positives (non-spam classified as spam).

The most pragmatic thing to do in a situation like this is to apply spam-filtering to your abuse address. This blackholes real spam to keep it from bothering people while carefully not saying anything to real senders who had their messages misclassified. But this pragmatism sort of bothers me because it's lying to real senders just to pacify them (their email is being ignored either way but you're deliberately doing it silently so they don't know). It would be more honest to use spam rejection on the abuse address, and it might do some good to reduce the level of spam. If legitimate email to your abuse address really is vanishingly rare, it also shouldn't affect very many people.

So what's the right answer? I have no idea.

(My current approach of exempting the abuse address from my personal filtering would not be viable if it got a lot of spam. At that point I would probably remove the exemption and let spam-tagged email to the abuse address get quietly filtered away, mostly because it's easier than trying to persuade everyone that maybe we should do spam rejection for email to abuse.)

AbuseRejection written at 02:24:38; Add Comment

2012-01-08

The latest annoyance with Google Groups

The Google Groups spam attempts continues to roll in. This recently led me to yet another unpleasant discovery about Google Groups: as far as I can tell, there is no way to unsubscribe from a Google Groups mailing list (at least through the website). At least there's no way from the outside; it might be possible if you made a Google Groups account for yourself under the email address that is being spammed. For all of the obvious reasons I have no interest in doing that.

At this point I really don't know whether Google is evil or merely indifferent, and it doesn't really matter which. Providing people with no way to unsubscribe is yet another total failure of anything approaching responsible mailing list management. It's also a complete spammer (non-)feature; spammers never bother to implement unsubscription because of course they have no interest in ever seeing addresses go away.

Sidebar: how the spamming appears to work

I did some Groups searching using message IDs that I had, and it appears that the spammer is slightly sophisticated in their use of Groups. They have a 'distribution' mailing list, which is what I and more than 20,000 other people are on, and then they have a series of small, low-activity 'feeder' lists, which the main list is subscribed to. They send the spam to today's feeder list, the feeder list passes it to the main list, and then the main list spams us. I suspect that the feeder lists are all owned by different Google Groups identities than the main list.

This is an obvious exploit of automated anti-spam systems. From the perspective of a dumb system, clearly the problem is today's feeder list; it must have been set up by a spammer who is exploiting a legitimate main list. So eventually the system flags or does whatever to the feeder list but leaves the (apparently) innocently exploited main list alone, which just causes the spammer to make a new feeder list. This may seem stupid, but you can see why doing this the other way around would allow spammers to exploit an automated system to close down mailing lists that they don't like. Of course the real problem here is the automated abuse handling system, because you can't handle abuse reports entirely with automation.

GoogleGroupsNoUnsub written at 01:29:39; Add Comment

These are my WanderingThoughts
(About the blog)

GettingAround
Full index of entries
Recent comments

This is part of CSpace, and is written by ChrisSiebenmann.

* * *

Atom feeds are available; see the bottom of most pages.

This is a DWiki.
(Help)

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web

Search:
(Previous year)
By month for 2012: Jan Feb Mar Apr; before 2012.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.