Wandering Thoughts archives

2021-09-06

Why Nutch-based web spiders are now blocked here

Apache Nutch is, to quote its web page, "a well matured, production ready Web crawler". More specifically, it's a web crawler engine, which people can take and use to create web crawlers themselves. However, it has a little issue, which I put in a tweet:

If you write a web crawler engine, you should make it very hard for people to not fill in all of the information for a proper user-agent string (such as a URL explaining the crawling). Apache Nutch, I'm looking at you, given UA's of eg "test search engine/Nutch-1.19-SNAPSHOT".

I have some views on what User-Agent headers should include. Including an explanatory URL for your web crawler is one of my requirements; web crawlers that don't have it and draw my attention tend to get blocked here. In the case of Nutch, my attention was first drawn to a specific aggressive crawler, but then when I started looking more and more Nutch based crawlers started coming out of the woodwork, including the example in the tweet, and none of them with proper identification. Since this is a systematic issue with Nutch-based crawlers, I decided that I was not interested in playing whack-a-mole with whatever people came up with next and I was instead going to deal with the whole issue in one shot.

This puts Apache Nutch in the same category of dangerous power tools that are abused too often for my tolerances, much like my block of Wget based crawling. Someone could use Nutch competently for purposes that I don't object to donating resources to, but the odds are against them, and if they're competent enough, perhaps they will take the '/Nutch-...' string out of their user agent.

People may object that I'm making it harder for new web search engines to get established. This is a nice theory, but it doesn't match the reality of today's Internet; the odds that a new web crawler is going to be used for a new public search engine is almost zero. It's far more likely that at the best someone is establishing a new SEO or "brand optimization" company. At the worst, the web crawler will be used to find targets to drive far less desirable activity.

web/NutchNoMoreHere written at 23:24:08; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.