Wandering Thoughts archives

2013-11-24

Baidu's web spider ignores robots.txt (at least sometimes)

On my personal site I've had an entry in my robots.txt to totally disallow Baidu's web spider for a long time, for various reasons (including that it doesn't respect nofollow). Recently I was looking at my logs for another reason, and, well, imagine my surprise when I saw requests from something with a user-agent of Baiduspider/2.0. Further investigation showed that this has been going on for several months, although the request volume was not high. Worse, not only is Baidu's spider crawling when it shouldn't be, it seems to be not requesting robots.txt more than very occasionally (and it usually requests robots.txt with a blandly non-robot user-agent, or at least I assume that the fetches of robots.txt from Baidu's IP address range are from their robot).

All of this seems to have started when I switched my personal site to all-HTTPS, but that's not an excuse for Baidu (or anyone else). Yes, there are redirections from the HTTP version involved, but things still work (and I actually wound up making an exemption for _robots.txt). The plain fact is that Baidu is flagrantly ignoring robots.txt and not even fetching it.

I don't tolerate this sort of web spider behavior. As a result, on my personal site Baidu is now blocked at the web server level (based on both IP address and user-agent) and I've just added similar blocks for it here on Wandering Thoughts. I'm aware that Baidu doesn't care about a piddly little site like me blocking them but I do, so I'm doing this no matter how quixotic it feels like.

(I'm writing this entry about the situation because Baidu's behavior makes me genuinely angry (for a small amount of anger). And bad behavior by a major search engine should be called out.)

web/BaiduIgnoresRobotsTxt written at 02:01:35; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.