PlanetLab hammers on robots.txt

May 15, 2006

The Planet Lab consortium is, to quote its banner, 'an open platform for developing, deploying, and accessing planetary-scale services'. Courtesy of a friend noticing, today's planetary-scale service appears to be repeatedly requesting robots.txt from people's webservers.

Here, they've made 523 requests (so far) from 323 different IP addresses (PlanetLab nodes are hosted around the Internet, mostly at universities; they usually have 'planetlab' or 'pl' or the like in their hostnames). The first request arrived at 03:56:11 (Eastern) on May 14th, and they're still rolling in. So far, they haven't requested anything besides robots.txt.

All of the requests have had the User-Agent string:

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6

This User-Agent string is a complete lie, which is one of the things that angers me about this situation. The minimum standard for acceptable web spider behavior is to clearly label yourself; pretending that you are an ordinary browser is an automatic sign of evil. If PlanetLab had a single netblock, it would now be in our port 80 IP filters.

Apparently the PlanetLab project responsible for this abuse is called umd_sidecar, and has already been reported to the PlanetLab administration by people who have had better luck navigating their search interfaces than I have. (It looks like the magic is to ask for advanced search and then specify that you want TCP as the protocol.)

Comments on this page:

From at 2006-05-15 14:20:39:

The perpetrator of the wack traffic says he's all done now, and is sorry for the software and human error that made it alarming.


Apologies about the burst of web traffic. We are running an experiment to map the core of the Internet. Such a map will be extremely useful in understanding large scale Internet behavior and future Internet engineering projects. We have done everything we can think of to limit the load and intrusiveness of our experiments, for example, using the smallest file we can request (i.e., robots.txt) and limiting our request rate. A combination of software and human error has made our experiment appear less innocuous than intended, and we appreciate your understanding.

The http requests for robot.txt is what allows us to map the path from planetlab node into the core of the network. So while they appear useless, they actually provide a good bit of data about the underlying network. In any case, I have completed my scanning, so you won't see this problem again.

If you have any additional comments or concerns, please feel free to mail me directly.

Apologies again,

- Rob Sherwood University of Maryland

Written on 15 May 2006.
« Absolute versus relative URLs in syndication feeds
A Python limit I never expected to run into »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon May 15 01:18:50 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.