My web spider technical requirements

November 23, 2007

I have been looking at the web server logs lately, which is usually an activity that's calculated to put me into a grumpy mood. As a result of this, I have decided to write down in one place my technical requirements for web spiders, the direct things that I expect them to do and not do.

(This is excludes issues of basic functionality, like correctly parsing HTML, resolving relative links, and making correct HTTP requests, and basic proper web spider behavior like respecting robots.txt.)

So, my minimum technical requirements for spiders:

  • obey nofollow. This is not optional.
  • do not send Referer: headers, especially ones chosen at random.
  • do not issue GET requests for URLs you've only seen as POST targets.
  • do not crawl binary files repeatedly, especially when they are large.
  • do not crawl clearly marked syndication feeds, especially repeatedly.

(What I consider a binary file instead of web content is fuzzy, but my minimal list is things like ISO images, tar files, RPMs, and audio and video files. I do not care if you want to build an MP3 or ISO search engine; such files are too large to be pulled automatically.)

Not following this list means that I will block your spider out of hand the moment I spot you in my logs, and violating any of these stands out like a sore thumb when I actually look. Following it doesn't mean that I'll actually like your spider, but that's another entry.

(So many spiders do not obey nofollow that I usually only do partial blocks for that.)

Disclaimer: I reserve the right to amend this list as more things occur to me, probably as the result of seeing yet more badly behaved web spiders.

Comments on this page:

From at 2007-11-25 13:28:42:

One thing about blocking spiders. Unless it is by subnet, blocking user agents may not work. There is no reason for them to not just say that they are Firefox or IE. Especially if they aren't curtious enough to follow robots.txt.

By cks at 2007-11-26 23:05:47:

I'm really only talking about 'legitimate' or 'real' web spiders, things that are actually trying to be socially acceptable; these are going to read robots.txt and have non-lying user-agents and so on, because that is part of being a socially acceptable web spider these days. There's very little point in talking about how I want malicious web spiders to behave, partly because they aren't going to behave and partly because it wouldn't matter if they did, I'd still want to block them.

(I just blathered about the naming issue at length in WebSpiderMeaning.)

Written on 23 November 2007.
« The different types of hash collisions
Gotchas with dual-headed X with RandR on ATI cards »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Nov 23 23:41:33 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.