Why Nutch-based web spiders are now blocked here
Apache Nutch is, to quote its web page, "a well matured, production ready Web crawler". More specifically, it's a web crawler engine, which people can take and use to create web crawlers themselves. However, it has a little issue, which I put in a tweet:
If you write a web crawler engine, you should make it very hard for people to not fill in all of the information for a proper user-agent string (such as a URL explaining the crawling). Apache Nutch, I'm looking at you, given UA's of eg "test search engine/Nutch-1.19-SNAPSHOT".
I have some views on what
User-Agent headers should include. Including an explanatory URL for your web
crawler is one of my requirements; web crawlers that don't have it
and draw my attention tend to get blocked here. In the case of
Nutch, my attention was first drawn to a specific aggressive crawler,
but then when I started looking more and more Nutch based crawlers
started coming out of the woodwork, including the example in the
tweet, and none of them with proper identification. Since this is
a systematic issue with Nutch-based crawlers, I decided that I was
not interested in playing whack-a-mole with whatever people came
up with next and I was instead going to deal with the whole issue
in one shot.
This puts Apache Nutch in the same category of dangerous power tools that are abused too often for my tolerances, much like my block of Wget based crawling. Someone could use Nutch competently for purposes that I don't object to donating resources to, but the odds are against them, and if they're competent enough, perhaps they will take the '/Nutch-...' string out of their user agent.
People may object that I'm making it harder for new web search engines to get established. This is a nice theory, but it doesn't match the reality of today's Internet; the odds that a new web crawler is going to be used for a new public search engine is almost zero. It's far more likely that at the best someone is establishing a new SEO or "brand optimization" company. At the worst, the web crawler will be used to find targets to drive far less desirable activity.