Wandering Thoughts archives

2009-07-28

Spammers are quite dedicated in their address scraping

This is one of those entries that require some apparently irrelevant background.

The Atom syndication feed format requires that each entry have a unique identifier assigned to it (the atom:id element, to use XML jargon). This identifier is a valid URI, formed using any number of schemes (see here). DWiki (the software behind WanderingThoughts) initially used the full URL of entries as the Atom ID, because this required no additional configuration or per-entry metadata. However, this causes serious problems if you ever want to move your blog, so not too long ago I switched new entries to using tag: URIs.

While you can read all the gory details here, the simple version of tag: URIs is that they look like this (without spaces and quotes):

"tag:" authorityName "," date ":" path

The authorityName is normally a domain; however, the spec says that you can use '<id>@<domain>' as well. For reasons beyond the scope of this entry, I decided to use the second format for the tag: URIs here, with the authorityName being cspace@<domain>.

(In brief: the advantage of this format is that you don't have to invent a new subdomain for everything you host; you use one domain and have a unique identifier as the <id> bit.)

You can see where this is going. A bit over a month after I started using this format for Atom IDs, I started getting email attempts to 'cspace@<domain>' (which were rejected; there is no requirement that such authorityNames actually are email addresses, and the domain I used doesn't even accept email to start with).

After talking about this with some people, the general speculation is not that spammers are scraping Atom feeds for tag: URIs with email address (which would show true dedication and craziness), but that they are mining syndication feeds for anything that looks even vaguely like an email address. This sort of makes sense, especially if you assume that they're using brute force regexp-based scanners instead of making any attempt to understand syndication formats. But it makes a good illustration of how spammers will scrape anything in sight that might somewhere have an email address.

DedicatedScraping written at 10:23:01; Add Comment

2009-07-20

Minimalistic spam, another annoyance to worry about

I've started getting advance fee fraud spam which have as their entire contents something like this:

You won Three Million Pounds.contact Anita Meyer : <email address elided>

At first I was amused by the minimalism and lack of effort on the spammer's part; it'd be hard to get an advance fee fraud attempt in less words. But the more I think about it, the more that I think this may be more clever than it looks (whether or not it's deliberate).

Modern anti-spam filters are quite good at analyzing text and detecting signs of spam. But tiny, minimal messages like this give them a problem (and indeed this one passed the spam filters with a low score), because there's almost no text for anti-spam tools to sink their teeth into. The less text there is for textual analysis, the more you're going to have to rely on some sort of meaning analysis, which has problems.

(I am relatively convinced of the existence of a general trend of giving anti-spam tools less text to work on. I've been seeing spam where the real payload was a PDF or .doc file for a while; I presume this is done because it (currently) hides the spam text from anti-spam content analysis.)

This text still has markers that could sort of be matched on, and probably a pure Bayesian approach would work well (since there's a number of words in there that probably don't normally appear in your email). But I'm not convinced that either will hold up in the long term; smarter spammers can eliminate the obvious markers, and probably there's a lot of room for rephrasing the message and using a less distinct set of words.

MinimalisticSpam written at 00:01:54; Add Comment

By day for July 2009: 20 28; before July; after July.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.