Wandering Thoughts archives


My web spider technical requirements

I have been looking at the web server logs lately, which is usually an activity that's calculated to put me into a grumpy mood. As a result of this, I have decided to write down in one place my technical requirements for web spiders, the direct things that I expect them to do and not do.

(This is excludes issues of basic functionality, like correctly parsing HTML, resolving relative links, and making correct HTTP requests, and basic proper web spider behavior like respecting robots.txt.)

So, my minimum technical requirements for spiders:

  • obey nofollow. This is not optional.
  • do not send Referer: headers, especially ones chosen at random.
  • do not issue GET requests for URLs you've only seen as POST targets.
  • do not crawl binary files repeatedly, especially when they are large.
  • do not crawl clearly marked syndication feeds, especially repeatedly.

(What I consider a binary file instead of web content is fuzzy, but my minimal list is things like ISO images, tar files, RPMs, and audio and video files. I do not care if you want to build an MP3 or ISO search engine; such files are too large to be pulled automatically.)

Not following this list means that I will block your spider out of hand the moment I spot you in my logs, and violating any of these stands out like a sore thumb when I actually look. Following it doesn't mean that I'll actually like your spider, but that's another entry.

(So many spiders do not obey nofollow that I usually only do partial blocks for that.)

Disclaimer: I reserve the right to amend this list as more things occur to me, probably as the result of seeing yet more badly behaved web spiders.

web/SpiderTechnicalRequirements written at 23:41:33; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.