My expectations for responsible spider behavior
My minimum technical requirements for real web spiders are deliberately quite black and white. But there are also a number of more fuzzy things that I expect from responsible web spiders. Bear in mind that these aren't hard and fast rules and I can't give precise numbers and so on.
(As before, this only applies to what I'm calling 'real' or 'legitimate' web spiders; I can't expect any particular behavior from malicious web spiders.)
Disclaimers in place, here's what I expect of responsible web spiders:
- check
robots.txt
frequently and adjust your behavior rapidly, say within no more than two days.(I do not care what infrastructure you require to do this; the fact that
robots.txt
updates have to propagate around six layers of your internal topology before reaching the crawler logic are your problem, not mine.) - don't make requests more frequently than one every few seconds or so.
- more importantly, notice when the website is slowing down and slow
down yourself. If the website's response speed is down, this is a
very big clue that your spider should space out requests more.
- don't rapidly re-crawl things that haven't changed. It's reasonable to check a few times just to make sure that what looks like unchanging content really is, but after that spiders should slow down. If you spend months revisiting a page three times a week when it hasn't changed in years, I get peeved.
- URLs that get errors count as unchanged pages. Crawl them a few times to make sure that they stay errors, but after that you should immediately demote them to the bottom of your crawl rates.
- this goes triple if the error you are getting is a 403 error, because you are being told explicitly that this is content you are not allowed to see.
Disclaimer: as before, I reserve the right to amend this list as more things occur to me, probably as the result of seeing yet more badly behaved web spiders.
(Suggestions of more are welcome; I'm probably missing some obvious ones.)
|
|