== Stupid web spider tricks In the spirit of [[earlier HowToGetYourSpiderBanned]] [[entrants HowToGetYourSpiderBannedII]], but not as bad, here's some stupid web spider tricks. The first stupid trick: crawling 'Add Comment' pages. Not only are the 'Add Comment' links marked _nofollow_ (so good little spiders shouldn't be going there), but it's also a great way to make me wonder if you're a would-be comment spammer and pay close attention to every CSpace page you hit. CSpace gets sufficiently few pages views at the moment that I can read all of the server logs, so I *will* notice. (All sorts of web spiders seem to find the 'Add Comment' links especially tasty for some reason; it's quite striking. I'm pretty sure they're the most common _nofollow_ links for web spiders to crawl.) The second stupid trick: including a URL explaining your spider, but having [[that URL http://wortschatz.uni-leipzig.de/findlinks/]] be a '403 permission denied' error page. Fortunately for my irritation level, I could find a copy in [[Google's cache http://www.google.com/search?hl=en&q=findlinks+%22uni-leipzig.de%22]] (pick the cached version of the obvious web page) and it more or less explained the web spider was doing. Thus, today's entrant is the 'findlinks' web spider, from various [[139.18.2.*|]] and [[139.18.13.*|]] IP addresses (which belong to uni-leipzig.de) plus a few hits from 80.237.144.96 (which doesn't seem to). The spider seems to be a distributed one, where any client machine that uses the software can crawl you. (I'm not sure I like distributed crawlers.) On a side note, I derive a certain amount of amusement from seeing English Apache error messages on a foreign language website. (Other information on the findlinks spider: in [[this huge database of spiders http://www.psychedelix.com/agents/index.shtml]] or [[here http://www.internetofficer.com/web-robot/findlinks.html]].)