My web spider technical requirements
I have been looking at the web server logs lately, which is usually an activity that's calculated to put me into a grumpy mood. As a result of this, I have decided to write down in one place my technical requirements for web spiders, the direct things that I expect them to do and not do.
(This is excludes issues of basic functionality, like correctly parsing
HTML, resolving relative links, and making correct HTTP requests, and
basic proper web spider behavior like respecting robots.txt
.)
So, my minimum technical requirements for spiders:
- obey
nofollow
. This is not optional. - do not send
Referer:
headers, especially ones chosen at random. - do not issue
GET
requests for URLs you've only seen asPOST
targets. - do not crawl binary files repeatedly, especially when they are large.
- do not crawl clearly marked syndication feeds, especially repeatedly.
(What I consider a binary file instead of web content is fuzzy, but my minimal list is things like ISO images, tar files, RPMs, and audio and video files. I do not care if you want to build an MP3 or ISO search engine; such files are too large to be pulled automatically.)
Not following this list means that I will block your spider out of hand the moment I spot you in my logs, and violating any of these stands out like a sore thumb when I actually look. Following it doesn't mean that I'll actually like your spider, but that's another entry.
(So many spiders do not obey nofollow
that I usually only do partial
blocks for that.)
Disclaimer: I reserve the right to amend this list as more things occur to me, probably as the result of seeing yet more badly behaved web spiders.
Comments on this page:
|
|