Baidu's web spider ignores robots.txt (at least sometimes)

November 24, 2013

On my personal site I've had an entry in my robots.txt to totally disallow Baidu's web spider for a long time, for various reasons (including that it doesn't respect nofollow). Recently I was looking at my logs for another reason, and, well, imagine my surprise when I saw requests from something with a user-agent of Baiduspider/2.0. Further investigation showed that this has been going on for several months, although the request volume was not high. Worse, not only is Baidu's spider crawling when it shouldn't be, it seems to be not requesting robots.txt more than very occasionally (and it usually requests robots.txt with a blandly non-robot user-agent, or at least I assume that the fetches of robots.txt from Baidu's IP address range are from their robot).

All of this seems to have started when I switched my personal site to all-HTTPS, but that's not an excuse for Baidu (or anyone else). Yes, there are redirections from the HTTP version involved, but things still work (and I actually wound up making an exemption for _robots.txt). The plain fact is that Baidu is flagrantly ignoring robots.txt and not even fetching it.

I don't tolerate this sort of web spider behavior. As a result, on my personal site Baidu is now blocked at the web server level (based on both IP address and user-agent) and I've just added similar blocks for it here on Wandering Thoughts. I'm aware that Baidu doesn't care about a piddly little site like me blocking them but I do, so I'm doing this no matter how quixotic it feels like.

(I'm writing this entry about the situation because Baidu's behavior makes me genuinely angry (for a small amount of anger). And bad behavior by a major search engine should be called out.)

Comments on this page:

Robots.txt is not a standard, and their is no regulation behind it. None of the spiders out their are obliged to conform to it. If you publish it, they will come :)

Written on 24 November 2013.
« You are not fooling us with broken bounce addresses
Track your disk failures »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 24 02:01:35 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.