Wandering Thoughts archives


How to get your web spider banned from here

To get your web spider banned here, do as many of the the following as possible:

  • make a lot of requests, so that we notice you. 1,500 over two days, for example.

  • make repeated rapid requests for the same unchanging pages, in case the sheer volume didn't get our attention. Over the same two days, fetch something that hasn't changed since June 18th eight times and a bunch of other similar pages seven times each.

  • to make sure we notice you, crawl through links marked nofollow. It's advisory, so never mind that the major search engines don't do this and people have come to expect nofollow can be used as a 'keep out' sign that's more flexible than robots.txt.

  • use an uninformative and generic user agent string, like "Jakarta Commons-HttpClient/3.0-rc4".

  • fetch our robots.txt file desultorily and only several days after you start visiting.

  • keep up the mystery of who you are by making all of your requests from machines with no reverse DNS, like the machines that hit us from

  • once we've identified your subnet as belonging to 'Meaningful Machines', on no account have your own contact information in the WHOIS data. I enjoy Googling to try to find the website of spiders crawling us; it makes my life more exciting.

  • Once I have found out that meaningfulmachines.com is your domain, make sure that your website has no visible information on your spidering activities. For bonus points, try to have no real information at all.
  • extra bonus points are awarded for generic contact addresses that look suspiciously like autoresponders, or at least possible inputs to marketing email lists. (In this day and age, I don't mail mail 'information@<anywhere>' to reach technical people.)

Since I have no desire to block everyone using the Jakarta Commons code and no strong belief that Meaningful Machines is paying much attention to our robots.txt anyways, their subnet now resides in our permanent kernel level IP blocks.

(PS: yes, I sent them email about this last week, to their domain contact address. I haven't received any reply, not that I really expected one.)

Some Googling suggests that I am not alone in having problems with them; one poster on webmasterworld.com (which blocks direct links, so I can't give you a URL) reported seeing 60,000 requests in an hour (and no fetching of robots.txt) in late May 2005. You may want to peruse your own logs for requests from the subnet and take appropriate action.

web/HowToGetYourSpiderBanned written at 01:27:02; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.