Wandering Thoughts archives


PlanetLab hammers on robots.txt

The Planet Lab consortium is, to quote its banner, 'an open platform for developing, deploying, and accessing planetary-scale services'. Courtesy of a friend noticing, today's planetary-scale service appears to be repeatedly requesting robots.txt from people's webservers.

Here, they've made 523 requests (so far) from 323 different IP addresses (PlanetLab nodes are hosted around the Internet, mostly at universities; they usually have 'planetlab' or 'pl' or the like in their hostnames). The first request arrived at 03:56:11 (Eastern) on May 14th, and they're still rolling in. So far, they haven't requested anything besides robots.txt.

All of the requests have had the User-Agent string:

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6

This User-Agent string is a complete lie, which is one of the things that angers me about this situation. The minimum standard for acceptable web spider behavior is to clearly label yourself; pretending that you are an ordinary browser is an automatic sign of evil. If PlanetLab had a single netblock, it would now be in our port 80 IP filters.

Apparently the PlanetLab project responsible for this abuse is called umd_sidecar, and has already been reported to the PlanetLab administration by people who have had better luck navigating their search interfaces than I have. (It looks like the magic is to ask for advanced search and then specify that you want TCP as the protocol.)

web/PlanetLabGoesRobotic written at 01:18:50; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.