How to get your web spider banned from here
To get your web spider banned here, do as many of the the following as possible:
- make a lot of requests, so that we notice you. 1,500 over two days,
for example.
- make repeated rapid requests for the same unchanging pages, in
case the sheer volume didn't get our attention. Over the same two
days, fetch something that hasn't changed since June 18th eight
times and a bunch of other similar pages seven times each.
- to make sure we notice you, crawl through links marked
nofollow
. It's advisory, so never mind that the major search engines don't do this and people have come to expectnofollow
can be used as a 'keep out' sign that's more flexible thanrobots.txt
. - use an uninformative and generic user agent string, like
"Jakarta Commons-HttpClient/3.0-rc4"
. - fetch our robots.txt file desultorily and only
several days after you start visiting.
- keep up the mystery of who you are by making all of your requests
from machines with no reverse DNS, like the machines that hit us
from 64.94.163.128/27.
- once we've identified your subnet as belonging to 'Meaningful
Machines', on no account have your own contact information in the
WHOIS data. I enjoy Googling to try to find the website of spiders
crawling us; it makes my life more exciting.
- Once I have found out that meaningfulmachines.com is your domain, make sure that your website has no visible information on your spidering activities. For bonus points, try to have no real information at all.
- extra bonus points are awarded for generic contact addresses that look suspiciously like autoresponders, or at least possible inputs to marketing email lists. (In this day and age, I don't mail mail 'information@<anywhere>' to reach technical people.)
Since I have no desire to block everyone using the Jakarta Commons code and no strong belief that Meaningful Machines is paying much attention to our robots.txt anyways, their subnet now resides in our permanent kernel level IP blocks.
(PS: yes, I sent them email about this last week, to their domain contact address. I haven't received any reply, not that I really expected one.)
Some Googling suggests that I am not alone in having problems with
them; one poster on webmasterworld.com (which blocks direct links, so
I can't give you a URL) reported seeing 60,000 requests in an hour
(and no fetching of robots.txt
) in late May 2005. You may want to
peruse your own logs for requests from the 64.94.163.128/27 subnet and
take appropriate action.
|
|