Wandering Thoughts archives

2006-07-15

A robot wish

I've come to realize that I have a two-part wish about web robots and spiders and so on. To wit:

I wish that there was something that all robots put in the HTTP request headers (perhaps having 'ROBOT' somewhere in their User-Agent string), and then that there was a standard HTTP response code for 'request declined because robots should not crawl this resource'.

Part of the problems of dealing with (well behaved) robots is that the only real robot signature is fetching robots.txt, and even that isn't a sure thing. You can look at User-Agent strings to recognize specific robots, but this doesn't scale and it's reactive, not proactive. (I say it doesn't scale because in the past 28 days, over 100 different robotic-looking User-Agent strings fetched robots.txt here.)

Having a definite robot signature in each request would make all sorts of robot filtering much easier and more reliable (and we wouldn't have to depend on robots.txt to do it, which has problems). And with a specific error response for it, robots could unambiguously know what was going on and behave appropriately.

(You could also avoid having to give away information in robots.txt about exactly what you don't want robots indexing, which can sometimes be very interesting to nosy people.)

At the dawn of the robot era, it would have been pretty easy to introduce at least the per-request robot signature (an extended 'no robots please' status code might have been more challenging). Unfortunately it's too late by now. Still, if you're writing a new web spider I urge you to start a new movement and put 'ROBOT' somewhere in your User-Agent string.

(PS: I'm not suggesting that this mechanism should replace robots.txt; robots.txt is very useful for efficient bulk removals when they can be expressed within its limits. I'd like to have both available.)

web/ARobotWish written at 02:22:53;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.