A robot wishI've come to realize that I have a two-part wish about web robots and spiders and so on. To wit: I wish that there was something that all robots put in the HTTP request headers (perhaps having 'ROBOT' somewhere in their User-Agent string), and then that there was a standard HTTP response code for 'request declined because robots should not crawl this resource'. Part of the problems of dealing with (well behaved) robots is that the
only real robot signature is fetching Having a definite robot signature in each request would make all
sorts of robot filtering much easier and more reliable (and we
wouldn't have to depend on (You could also avoid having to give away information in At the dawn of the robot era, it would have been pretty easy to introduce at least the per-request robot signature (an extended 'no robots please' status code might have been more challenging). Unfortunately it's too late by now. Still, if you're writing a new web spider I urge you to start a new movement and put 'ROBOT' somewhere in your User-Agent string. (PS: I'm not suggesting that this mechanism should replace |
These are my WanderingThoughts GettingAround This is part of CSpace, and is written by ChrisSiebenmann. * * * Atom feeds are available; see the bottom of most pages. Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web |