I'm not sure what I feel about this web spider's User-Agent value

May 19, 2017

Every so often I do the unwise thing of turning over rocks in the web logs for this blog. Today, one of the things that I found under there was a web spider with the claimed User-Agent of:

BuckyOHare/1.4 (Googlebot/2.1; +https://hypefactors.com/webcrawler)

The requests all came from AWS IP address space, so I have no idea if this actually belongs to the people that it claims to. As is typical for these spiders, it got my attention primarily by attempting to access URLs that no crawler should.

The bit that raised my eyebrows a lot is the mention of Googlebot. On the one hand, there is a long tradition of browsers including the name of other browsers in their User-Agents in order to persuade web sites to do the right thing and serve them the right content. On the other hand, the biggest reason that I can think of to claim to be Googlebot is so that web sites that give Googlebot special allowances for crawling things will extend those allowances to you, and that's a rather different kind of fakery.

(Ironically this backfired for these people because I already had Googlebot blocked off from almost all of the URLs that they tried to access. It does raise my eyebrows again that almost all of the pages they tried to access were Atom feeds or 'write a comment' pages. For now I've decided that I don't trust these people enough to allow them any access to Wandering Thoughts, so they're now totally blocked.)

I wouldn't be surprised if other web spider operators have also experimented with this clever idea already. If not, I rather suspect that more people will in the future. Given that there are websites that are willing (or reluctantly forced) to allow Google(bot) access but would rather like to block everyone else, more than a few of them are probably using User-Agent matching instead of anything more sophisticated.

(Partly this is because more sophisticated methods are some combination of more work to maintain and more time to check in the web server itself.)

Written on 19 May 2017.
« A shift in the proper sizes of images on web pages
We now have an officially standardized 'null MX' record »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri May 19 00:52:28 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.