== My web spider technical requirements I have been looking at the web server logs lately, which is usually an activity that's calculated to put me into a grumpy mood. As a result of this, I have decided to write down in one place my technical requirements for web spiders, the direct things that I expect them to do and not do. (This is excludes issues of basic functionality, like correctly parsing HTML, resolving relative links, and making correct HTTP requests, and basic proper web spider behavior like respecting _robots.txt_.) So, my minimum technical requirements for spiders: * obey _nofollow_. This is [[not optional RespectTheNofollow]]. * do not send _Referer:_ headers, especially ones [[chosen at random HowToGetYourSpiderBannedIV]]. * do not issue _GET_ requests for URLs you've [[only seen as _POST_ targets URLNamespaces]]. * do not crawl binary files repeatedly, especially when they are [[large MSNbotBinariesProblem]]. * do not crawl clearly marked syndication feeds, especially repeatedly. (What I consider a binary file instead of web content is fuzzy, but my minimal list is things like ISO images, tar files, RPMs, and audio and video files. I do not care if you want to build an MP3 or ISO search engine; such files are *too large* to be pulled automatically.) Not following this list means that I will block your spider out of hand the moment I spot you in my logs, and violating any of these stands out like a sore thumb when I actually look. Following it doesn't mean that I'll actually like your spider, but that's another entry. (So many spiders do not obey _nofollow_ that I usually only do partial blocks for that.) Disclaimer: I reserve the right to amend this list as more things occur to me, probably as the result of seeing yet more badly behaved web spiders.