How to have your web spider irritate me intensely (part 2)

January 29, 2007

In the spirit of previous cleverness, here's a simple new technique:

Have your web spider make up random Referer headers.

This wasn't Referer spamming, since the websites in the Referer headers were completely random URLs, apparently drawn from legitimate sites around the Internet (often repeated). Nor were the websites ones that actually linked to us, or had any relationship to the URLs that were being crawled.

Even in low volume this is a sure-fire ticket to our kernel level IP filters, since it insures that we're mostly unable to get anything useful from our Referer logs without a lot of additional work and is therefor deeply irritating.

Today's offender is the IP address 212.52.80.101, which is an unnamed iol.it IP address; it is using a User-Agent value of 'Mozilla/5.0 (arianna.libero.it,ariannaadm@pisa.iol.it)'. It does seem to have requested robots.txt, but of course the User-Agent string gives no clues as to what User-Agent setting in there will turn it off. Ironically it appears to respect nofollow, unlike many other better-behaved web spiders.

Written on 29 January 2007.
« Why DWiki doesn't use fully REST-ful URLs
A gotcha with <textarea> »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 29 12:56:32 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.