An obvious thing about dealing with web spider misbehavior

March 18, 2009

Here's an important hint for operators of web spiders:

If I have to touch robots.txt or my web server configuration to deal with your spider's behavior, the easiest and safest change is to block you entirely.

It is therefor very much in the interests of web spider operators to keep me from ever having to touch my robots.txt file in the first place, because you are not Google. You should consider things like crawl-delay to be desperate last ditch workarounds, not things that you routinely recommend to people you are crawling.

(Yes, this means that web spiders should notice issues like this for themselves. The best use for crawl-delay that I've seen suggested is as a way to tell spiders to speed up, not slow down, their usual crawling rate.)

A corollary to this is that your spider information page should do its best to concisely tell people what they get out of letting you crawl their web pages, because if people have to change their robots.txt to deal with your spider you want them to have as much reason not to block you as possible.

(You had better have a spider information page and mention it in your spider's User-agent. Ideally it will also rank high in searches for your spider's official name and for its User-agent string.)

Written on 18 March 2009.
« Principles of email in the modern age
Why 'sender stores message' schemes won't cure phish spams »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Mar 18 22:59:34 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.