== Baidu's web spider ignores _robots.txt_ (at least sometimes) On [[my personal site https://cks.mef.org/]] I've had an entry in my _robots.txt_ to totally disallow Baidu's web spider for a long time, for various reasons (including that [[it doesn't respect _nofollow_ RespectTheNofollow]]). Recently I was looking at my logs for another reason, and, well, imagine my surprise when I saw requests from something with a user-agent of _Baiduspider/2.0_. Further investigation showed that this has been going on for several months, although the request volume was not high. Worse, not only is Baidu's spider crawling when it shouldn't be, it seems to be not requesting _robots.txt_ more than very occasionally (and it usually requests _robots.txt_ with a blandly non-robot user-agent, or at least I assume that the fetches of _robots.txt_ from Baidu's IP address range are from their robot). All of this seems to have started when [[I switched my personal site to all-HTTPS PragmaticHTTPtoHTTPS]], but that's not an excuse for Baidu (or anyone else). Yes, there are redirections from the HTTP version involved, but things still work ([[and I actually wound up making an exemption for _robots.txt HTTPSTransitionLessonsLearned]]). The plain fact is that Baidu is flagrantly ignoring _robots.txt_ and not even fetching it. I don't tolerate this sort of web spider behavior. As a result, on [[my personal site]] Baidu is now blocked at the web server level (based on both IP address and user-agent) and I've just added similar blocks for it here on [[Wandering Thoughts /blog]]. I'm aware that Baidu doesn't care about a piddly little site like me blocking them but I do, so I'm doing this no matter how quixotic it feels like. (I'm writing this entry about the situation because Baidu's behavior makes me genuinely angry (for a small amount of anger). And bad behavior by a major search engine should be called out.)