Stupid web spider tricks

February 18, 2006

In the spirit of earlier entrants, but not as bad, here's some stupid web spider tricks.

The first stupid trick: crawling 'Add Comment' pages. Not only are the 'Add Comment' links marked nofollow (so good little spiders shouldn't be going there), but it's also a great way to make me wonder if you're a would-be comment spammer and pay close attention to every CSpace page you hit. CSpace gets sufficiently few pages views at the moment that I can read all of the server logs, so I will notice.

(All sorts of web spiders seem to find the 'Add Comment' links especially tasty for some reason; it's quite striking. I'm pretty sure they're the most common nofollow links for web spiders to crawl.)

The second stupid trick: including a URL explaining your spider, but having that URL be a '403 permission denied' error page. Fortunately for my irritation level, I could find a copy in Google's cache (pick the cached version of the obvious web page) and it more or less explained the web spider was doing.

Thus, today's entrant is the 'findlinks' web spider, from various 139.18.2.* and 139.18.13.* IP addresses (which belong to plus a few hits from (which doesn't seem to). The spider seems to be a distributed one, where any client machine that uses the software can crawl you. (I'm not sure I like distributed crawlers.)

On a side note, I derive a certain amount of amusement from seeing English Apache error messages on a foreign language website.

(Other information on the findlinks spider: in this huge database of spiders or here.)

Written on 18 February 2006.
« Some regular expression performance surprises
Automation promotes action »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Feb 18 02:35:29 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.