Wandering Thoughts archives

2009-03-18

An obvious thing about dealing with web spider misbehavior

Here's an important hint for operators of web spiders:

If I have to touch robots.txt or my web server configuration to deal with your spider's behavior, the easiest and safest change is to block you entirely.

It is therefor very much in the interests of web spider operators to keep me from ever having to touch my robots.txt file in the first place, because you are not Google. You should consider things like crawl-delay to be desperate last ditch workarounds, not things that you routinely recommend to people you are crawling.

(Yes, this means that web spiders should notice issues like this for themselves. The best use for crawl-delay that I've seen suggested is as a way to tell spiders to speed up, not slow down, their usual crawling rate.)

A corollary to this is that your spider information page should do its best to concisely tell people what they get out of letting you crawl their web pages, because if people have to change their robots.txt to deal with your spider you want them to have as much reason not to block you as possible.

(You had better have a spider information page and mention it in your spider's User-agent. Ideally it will also rank high in searches for your spider's official name and for its User-agent string.)

SpiderRobotsTxtHint written at 22:59:34; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.