Stupid web spider tricks
The first stupid trick: crawling 'Add Comment' pages. Not only are the
'Add Comment' links marked
nofollow (so good little spiders shouldn't
be going there), but it's also a great way to make me wonder if you're
a would-be comment spammer and pay close attention to every CSpace page
you hit. CSpace gets sufficiently few pages views at the moment that I
can read all of the server logs, so I will notice.
(All sorts of web spiders seem to find the 'Add Comment' links
especially tasty for some reason; it's quite striking. I'm pretty sure
they're the most common
nofollow links for web spiders to crawl.)
The second stupid trick: including a URL explaining your spider, but having that URL be a '403 permission denied' error page. Fortunately for my irritation level, I could find a copy in Google's cache (pick the cached version of the obvious web page) and it more or less explained the web spider was doing.
Thus, today's entrant is the 'findlinks' web spider, from various 139.18.2.* and 139.18.13.* IP addresses (which belong to uni-leipzig.de) plus a few hits from 220.127.116.11 (which doesn't seem to). The spider seems to be a distributed one, where any client machine that uses the software can crawl you. (I'm not sure I like distributed crawlers.)
On a side note, I derive a certain amount of amusement from seeing English Apache error messages on a foreign language website.