2006-07-15
A robot wish
I've come to realize that I have a two-part wish about web robots and spiders and so on. To wit:
I wish that there was something that all robots put in the HTTP request headers (perhaps having 'ROBOT' somewhere in their User-Agent string), and then that there was a standard HTTP response code for 'request declined because robots should not crawl this resource'.
Part of the problems of dealing with (well behaved) robots is that the
only real robot signature is fetching robots.txt
, and even that isn't
a sure thing. You can look at User-Agent strings to recognize specific
robots, but this doesn't scale and it's reactive, not proactive. (I
say it doesn't scale because in the past 28 days, over 100 different
robotic-looking User-Agent strings fetched robots.txt
here.)
Having a definite robot signature in each request would make all
sorts of robot filtering much easier and more reliable (and we
wouldn't have to depend on robots.txt
to do it, which has problems). And with a specific error response for it, robots
could unambiguously know what was going on and behave appropriately.
(You could also avoid having to give away information in robots.txt
about exactly what you don't want robots indexing, which can sometimes
be very interesting to nosy people.)
At the dawn of the robot era, it would have been pretty easy to introduce at least the per-request robot signature (an extended 'no robots please' status code might have been more challenging). Unfortunately it's too late by now. Still, if you're writing a new web spider I urge you to start a new movement and put 'ROBOT' somewhere in your User-Agent string.
(PS: I'm not suggesting that this mechanism should replace robots.txt
;
robots.txt
is very useful for efficient bulk removals when they can be
expressed within its limits. I'd like to have both available.)