An obvious thing about dealing with web spider misbehavior
Here's an important hint for operators of web spiders:
If I have to touch robots.txt or my web server configuration to deal with your spider's behavior, the easiest and safest change is to block you entirely.
It is therefor very much in the interests of web spider operators
to keep me from ever having to touch my robots.txt file in the
first place, because you are not Google.
You should consider things like crawl-delay
to be desperate last
ditch workarounds, not things that you routinely recommend to people
you are crawling.
(Yes, this means that web spiders should notice issues like this for
themselves. The best use for crawl-delay
that I've seen suggested is as a way to tell spiders to speed up, not
slow down, their usual crawling rate.)
A corollary to this is that your spider information page should do its best to concisely tell people what they get out of letting you crawl their web pages, because if people have to change their robots.txt to deal with your spider you want them to have as much reason not to block you as possible.
(You had better have a spider information page and mention it in your
spider's User-agent
. Ideally it will also rank high in searches for
your spider's official name and for its User-agent
string.)
|
|