PlanetLab hammers on robots.txt

May 15, 2006

The Planet Lab consortium is, to quote its banner, 'an open platform for developing, deploying, and accessing planetary-scale services'. Courtesy of a friend noticing, today's planetary-scale service appears to be repeatedly requesting robots.txt from people's webservers.

Here, they've made 523 requests (so far) from 323 different IP addresses (PlanetLab nodes are hosted around the Internet, mostly at universities; they usually have 'planetlab' or 'pl' or the like in their hostnames). The first request arrived at 03:56:11 (Eastern) on May 14th, and they're still rolling in. So far, they haven't requested anything besides robots.txt.

All of the requests have had the User-Agent string:

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6

This User-Agent string is a complete lie, which is one of the things that angers me about this situation. The minimum standard for acceptable web spider behavior is to clearly label yourself; pretending that you are an ordinary browser is an automatic sign of evil. If PlanetLab had a single netblock, it would now be in our port 80 IP filters.

Apparently the PlanetLab project responsible for this abuse is called umd_sidecar, and has already been reported to the PlanetLab administration by people who have had better luck navigating their search interfaces than I have. (It looks like the magic is to ask for advanced search and then specify that you want TCP as the protocol.)

Written on 15 May 2006.
« Absolute versus relative URLs in syndication feeds
A Python limit I never expected to run into »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon May 15 01:18:50 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.