My expectations for responsible spider behavior

December 1, 2007

My minimum technical requirements for real web spiders are deliberately quite black and white. But there are also a number of more fuzzy things that I expect from responsible web spiders. Bear in mind that these aren't hard and fast rules and I can't give precise numbers and so on.

(As before, this only applies to what I'm calling 'real' or 'legitimate' web spiders; I can't expect any particular behavior from malicious web spiders.)

Disclaimers in place, here's what I expect of responsible web spiders:

  • check robots.txt frequently and adjust your behavior rapidly, say within no more than two days.

    (I do not care what infrastructure you require to do this; the fact that robots.txt updates have to propagate around six layers of your internal topology before reaching the crawler logic are your problem, not mine.)

  • don't make requests more frequently than one every few seconds or so.
  • more importantly, notice when the website is slowing down and slow down yourself. If the website's response speed is down, this is a very big clue that your spider should space out requests more.

  • don't rapidly re-crawl things that haven't changed. It's reasonable to check a few times just to make sure that what looks like unchanging content really is, but after that spiders should slow down. If you spend months revisiting a page three times a week when it hasn't changed in years, I get peeved.
  • URLs that get errors count as unchanged pages. Crawl them a few times to make sure that they stay errors, but after that you should immediately demote them to the bottom of your crawl rates.
  • this goes triple if the error you are getting is a 403 error, because you are being told explicitly that this is content you are not allowed to see.

Disclaimer: as before, I reserve the right to amend this list as more things occur to me, probably as the result of seeing yet more badly behaved web spiders.

(Suggestions of more are welcome; I'm probably missing some obvious ones.)

Written on 01 December 2007.
« BitTorrent's protocol is not designed to hide
How my CGI to CGI/SCGI frontend works »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Dec 1 23:15:49 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.