Baidu's web spider ignores robots.txt
(at least sometimes)
On my personal site I've had an entry in my
robots.txt
to totally disallow Baidu's web spider for a long time,
for various reasons (including that it doesn't respect nofollow
). Recently I was looking at my logs for another
reason, and, well, imagine my surprise when I saw requests from
something with a user-agent of Baiduspider/2.0
. Further investigation
showed that this has been going on for several months, although the
request volume was not high. Worse, not only is Baidu's spider crawling
when it shouldn't be, it seems to be not requesting robots.txt
more
than very occasionally (and it usually requests robots.txt
with a
blandly non-robot user-agent, or at least I assume that the fetches of
robots.txt
from Baidu's IP address range are from their robot).
All of this seems to have started when I switched my personal site to
all-HTTPS, but that's not an excuse for Baidu
(or anyone else). Yes, there are redirections from the HTTP version
involved, but things still work (and I actually wound up making an
exemption for _robots.txt). The plain
fact is that Baidu is flagrantly ignoring robots.txt
and not even
fetching it.
I don't tolerate this sort of web spider behavior. As a result, on my personal site Baidu is now blocked at the web server level (based on both IP address and user-agent) and I've just added similar blocks for it here on Wandering Thoughts. I'm aware that Baidu doesn't care about a piddly little site like me blocking them but I do, so I'm doing this no matter how quixotic it feels like.
(I'm writing this entry about the situation because Baidu's behavior makes me genuinely angry (for a small amount of anger). And bad behavior by a major search engine should be called out.)
|
|