Wandering Thoughts archives

2006-04-03

An ugly spam attempt

Every so often I take a look at what user agents are visiting WanderingThoughts. Tonight it turned up a doozy; a single visit with a User-Agent of:

<script>window.open('<URL>')</script>

Presumably the intended attack vector was sites that summarize user agent traffic onto a web page without escaping the text; that would make this user-agent string into live JavaScript that would force any visitor's browser to go there.

The attack is also noteworthy for how brazen it is. The URL in the request is for 'buy4cheap.brinkster.net/buy2/side-search.htm', and the request itself came from 65.182.100.121, aka 'orf-premium12a.brinkster.com'. Most spammers are far less willing to clearly sign their work like that.

(I vacillated between calling this 'clever' or 'ugly'; I am going with 'ugly' because I don't like the implications of what these people are doing, and attempting to inject JavaScript is not a sign of angels.)

web/UglyWebSpammer written at 03:36:39;

Spiders should respect rel="nofollow"

If you're writing a web spider, what should you do when you see a link marked rel="nofollow"?

In theory, you can do nothing different from any other link. It's not a formal specification, and the original description only talks about the resulting link not giving the target any credit.

In practice, on the Internet what people expect is in large part defined by what the 800 pound gorillas do. And both Google and MSN Search consider nofollow to be literal: don't follow this link. In fact Google explicitly documents this behavior; see one of the original postings on nofollow, or Google's description of it in their help pages.

So the real answer is: if you see a rel="nofollow" link, you shouldn't crawl the target.

Since Google (the original creators of nofollow) describe it this way, I will go so far as to say that respecting nofollow requires you to not crawl marked links.

Spider authors should do this not just because it's what people expect, but because it's genuinely useful for guiding spiders around web sites. (Especially dynamic web sites like wikis and blogs, which can have a lot of different ways of viewing more or less the same content.)

web/RespectTheNofollow written at 02:54:03;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.