How to get your web spider banned from here
To get your web spider banned here, do as many of the the following as
- make a lot of requests, so that we notice you. 1,500 over two days,
- make repeated rapid requests for the same unchanging pages, in
case the sheer volume didn't get our attention. Over the same two
days, fetch something that hasn't changed since June 18th eight
times and a bunch of other similar pages seven times each.
- to make sure we notice you, crawl through links marked
nofollow. It's advisory, so never mind that the major search
engines don't do this and people have come to expect
be used as a 'keep out' sign that's more flexible than
- use an uninformative and generic user agent string, like
- fetch our robots.txt file desultorily and only
several days after you start visiting.
- keep up the mystery of who you are by making all of your requests
from machines with no reverse DNS, like the machines that hit us
- once we've identified your subnet as belonging to 'Meaningful
Machines', on no account have your own contact information in the
WHOIS data. I enjoy Googling to try to find the website of spiders
crawling us; it makes my life more exciting.
- Once I have found out that meaningfulmachines.com is your domain,
make sure that your website has no visible information on your
spidering activities. For bonus points, try to have no real
information at all.
- extra bonus points are awarded for generic contact addresses that
look suspiciously like autoresponders, or at least possible inputs
to marketing email lists. (In this day and age, I don't mail
mail 'information@<anywhere>' to reach technical people.)
Since I have no desire to block everyone using the Jakarta Commons
code and no strong belief that Meaningful Machines is paying much
attention to our robots.txt anyways, their subnet now resides in
our permanent kernel level IP blocks.
(PS: yes, I sent them email about this last week, to their domain
contact address. I haven't received any reply, not that I really
Some Googling suggests that I am not alone in having problems with
them; one poster on webmasterworld.com (which blocks direct links, so
I can't give you a URL) reported seeing 60,000 requests in an hour
(and no fetching of
robots.txt) in late May 2005. You may want to
peruse your own logs for requests from the 18.104.22.168/27 subnet and
take appropriate action.