Why Nutch-based web spiders are now blocked here

September 6, 2021

Apache Nutch is, to quote its web page, "a well matured, production ready Web crawler". More specifically, it's a web crawler engine, which people can take and use to create web crawlers themselves. However, it has a little issue, which I put in a tweet:

If you write a web crawler engine, you should make it very hard for people to not fill in all of the information for a proper user-agent string (such as a URL explaining the crawling). Apache Nutch, I'm looking at you, given UA's of eg "test search engine/Nutch-1.19-SNAPSHOT".

I have some views on what User-Agent headers should include. Including an explanatory URL for your web crawler is one of my requirements; web crawlers that don't have it and draw my attention tend to get blocked here. In the case of Nutch, my attention was first drawn to a specific aggressive crawler, but then when I started looking more and more Nutch based crawlers started coming out of the woodwork, including the example in the tweet, and none of them with proper identification. Since this is a systematic issue with Nutch-based crawlers, I decided that I was not interested in playing whack-a-mole with whatever people came up with next and I was instead going to deal with the whole issue in one shot.

This puts Apache Nutch in the same category of dangerous power tools that are abused too often for my tolerances, much like my block of Wget based crawling. Someone could use Nutch competently for purposes that I don't object to donating resources to, but the odds are against them, and if they're competent enough, perhaps they will take the '/Nutch-...' string out of their user agent.

People may object that I'm making it harder for new web search engines to get established. This is a nice theory, but it doesn't match the reality of today's Internet; the odds that a new web crawler is going to be used for a new public search engine is almost zero. It's far more likely that at the best someone is establishing a new SEO or "brand optimization" company. At the worst, the web crawler will be used to find targets to drive far less desirable activity.


Comments on this page:

People may object that I'm making it harder for new web search engines to get established. This is a nice theory, but it doesn't match the reality of today's Internet; the odds that a new web crawler is going to be used for a new public search engine is almost zero. It's far more likely that at the best someone is establishing a new SEO or "brand optimization" company. At the worst, the web crawler will be used to find targets to drive far less desirable activity.

There exists a large number of indexing search engines, many of which are somewhat recent and usable. I'd argue that because getting a new engine off the ground is difficult yet still feasible, it's even more important to make sites amenable to crawling.

I personally find that rate-limiting by IP and maybe user-agent is a much better option than user-agent bans.

By John Gwot at 2021-09-17 11:06:05:

I totally agree with this sentiment. If a crawler/spider/SEO etc. can't bother to adhere at least minimally with Web Standards (such as they are) like Robots.txt and proper UA strings, then why should I "give them resources" (which is what websites are to them).

Written on 06 September 2021.
« Firefox on Linux is still not working well with WebRender for me (again)
My long-lived personal Linux installs »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Sep 6 23:24:08 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.