How to get your web spider banned from here

December 21, 2005

To get your web spider banned here, do as many of the the following as possible:

  • make a lot of requests, so that we notice you. 1,500 over two days, for example.

  • make repeated rapid requests for the same unchanging pages, in case the sheer volume didn't get our attention. Over the same two days, fetch something that hasn't changed since June 18th eight times and a bunch of other similar pages seven times each.

  • to make sure we notice you, crawl through links marked nofollow. It's advisory, so never mind that the major search engines don't do this and people have come to expect nofollow can be used as a 'keep out' sign that's more flexible than robots.txt.

  • use an uninformative and generic user agent string, like "Jakarta Commons-HttpClient/3.0-rc4".

  • fetch our robots.txt file desultorily and only several days after you start visiting.

  • keep up the mystery of who you are by making all of your requests from machines with no reverse DNS, like the machines that hit us from 64.94.163.128/27.

  • once we've identified your subnet as belonging to 'Meaningful Machines', on no account have your own contact information in the WHOIS data. I enjoy Googling to try to find the website of spiders crawling us; it makes my life more exciting.

  • Once I have found out that meaningfulmachines.com is your domain, make sure that your website has no visible information on your spidering activities. For bonus points, try to have no real information at all.
  • extra bonus points are awarded for generic contact addresses that look suspiciously like autoresponders, or at least possible inputs to marketing email lists. (In this day and age, I don't mail mail 'information@<anywhere>' to reach technical people.)

Since I have no desire to block everyone using the Jakarta Commons code and no strong belief that Meaningful Machines is paying much attention to our robots.txt anyways, their subnet now resides in our permanent kernel level IP blocks.

(PS: yes, I sent them email about this last week, to their domain contact address. I haven't received any reply, not that I really expected one.)

Some Googling suggests that I am not alone in having problems with them; one poster on webmasterworld.com (which blocks direct links, so I can't give you a URL) reported seeing 60,000 requests in an hour (and no fetching of robots.txt) in late May 2005. You may want to peruse your own logs for requests from the 64.94.163.128/27 subnet and take appropriate action.


Comments on this page:

From 192.88.60.254 at 2005-12-21 11:22:31:

The weird thing is, over the past month the only thing they seem to be getting from snowplow.org is /robots.txt.

Well, no, I take that back - there is one instance in December of them fetching another url. However, they still saw fit to grab /robots.txt 32 times this month. Other months show similar patterns, with the few non-robots.txt urls being the urls that on snowplow that I've seen other spiders hit routinely - that is, the ones that are linked from elsewhere. (Though oddly enough they never hit /)

From 67.23.37.4 at 2006-01-26 00:43:25:

2006/01/25

Three out of five of my webservers serving one site were taken out today by their spider before I could put in my own IP filter. For example, they requested robots.txt from the same website 55 times in one second or "cart.cgi" (a resource-intensive shopping cart) 132 times in one minute. They use multiple machines to gang up on my servers simultaneously (which bypasses my load balancer traffic hashing algorithm) and a different cookie for each request to keep my servers from using cached data. They're evil.

I tried calling their main line, but instead of a receptionist, I got a "leave a message" answering machine with no options to get to a live person.

I have a complaint active with their ISP, InterNAP. If you have similar problems call InterNAP at 1-877-843-4662 and reference ticket number 192194 or email noc@internap.net and put the ticket number on the subject line.

Written on 21 December 2005.
« Another problem of secrecy
What I really want is error-shielding interfaces »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Dec 21 01:27:02 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.