My web spider technical requirements

November 23, 2007

I have been looking at the web server logs lately, which is usually an activity that's calculated to put me into a grumpy mood. As a result of this, I have decided to write down in one place my technical requirements for web spiders, the direct things that I expect them to do and not do.

(This is excludes issues of basic functionality, like correctly parsing HTML, resolving relative links, and making correct HTTP requests, and basic proper web spider behavior like respecting robots.txt.)

So, my minimum technical requirements for spiders:

  • obey nofollow. This is not optional.
  • do not send Referer: headers, especially ones chosen at random.
  • do not issue GET requests for URLs you've only seen as POST targets.
  • do not crawl binary files repeatedly, especially when they are large.
  • do not crawl clearly marked syndication feeds, especially repeatedly.

(What I consider a binary file instead of web content is fuzzy, but my minimal list is things like ISO images, tar files, RPMs, and audio and video files. I do not care if you want to build an MP3 or ISO search engine; such files are too large to be pulled automatically.)

Not following this list means that I will block your spider out of hand the moment I spot you in my logs, and violating any of these stands out like a sore thumb when I actually look. Following it doesn't mean that I'll actually like your spider, but that's another entry.

(So many spiders do not obey nofollow that I usually only do partial blocks for that.)

Disclaimer: I reserve the right to amend this list as more things occur to me, probably as the result of seeing yet more badly behaved web spiders.

Written on 23 November 2007.
« The different types of hash collisions
Gotchas with dual-headed X with RandR on ATI cards »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Nov 23 23:41:33 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.