Spammers are quite dedicated in their address scraping

July 28, 2009

This is one of those entries that require some apparently irrelevant background.

The Atom syndication feed format requires that each entry have a unique identifier assigned to it (the atom:id element, to use XML jargon). This identifier is a valid URI, formed using any number of schemes (see here). DWiki (the software behind WanderingThoughts) initially used the full URL of entries as the Atom ID, because this required no additional configuration or per-entry metadata. However, this causes serious problems if you ever want to move your blog, so not too long ago I switched new entries to using tag: URIs.

While you can read all the gory details here, the simple version of tag: URIs is that they look like this (without spaces and quotes):

"tag:" authorityName "," date ":" path

The authorityName is normally a domain; however, the spec says that you can use '<id>@<domain>' as well. For reasons beyond the scope of this entry, I decided to use the second format for the tag: URIs here, with the authorityName being cspace@<domain>.

(In brief: the advantage of this format is that you don't have to invent a new subdomain for everything you host; you use one domain and have a unique identifier as the <id> bit.)

You can see where this is going. A bit over a month after I started using this format for Atom IDs, I started getting email attempts to 'cspace@<domain>' (which were rejected; there is no requirement that such authorityNames actually are email addresses, and the domain I used doesn't even accept email to start with).

After talking about this with some people, the general speculation is not that spammers are scraping Atom feeds for tag: URIs with email address (which would show true dedication and craziness), but that they are mining syndication feeds for anything that looks even vaguely like an email address. This sort of makes sense, especially if you assume that they're using brute force regexp-based scanners instead of making any attempt to understand syndication formats. But it makes a good illustration of how spammers will scrape anything in sight that might somewhere have an email address.

Written on 28 July 2009.
« Why you should do code reviews for sysadmin scripts
The shift-selection trick in X terminal programs »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jul 28 10:23:01 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.