It feels surprisingly good to block Bingbot from my blog front page
Back last year I wrote about how Microsoft's Bingbot relentlessly crawled the front page of Wandering Thoughts. On pretty much every day, a single Bingbot IP would request the front page of Wandering Thoughts a thousand times or more. Back in that entry I said I was tempted to block Bingbot from doing that; recently, my vague irritation with Bingbot's ongoing behavior reached a boiling point, and I actually did that.
Since Wandering Thoughts is served through an Apache with
mod_rewrite enabled, the block was relatively simple to implement.
I just check for an exact match of the request URI (since Bingbot
never uses variations) and then match the user agent. By default,
RewriteCond conditions all must be true so this just
(The hardest bit was re-reading the mod_rewrite documentation
yet again to determine that I wanted to match against
This would have been faster if I'd actually fully read the
and followed the cross reference to expression variables.)
That Wandering Thoughts' front page now gives Bingbot 403s hasn't particularly slowed it down. Over the past almost 24 hours, Bingbot has made just under 1,400 requests for the URL from two different IPs (one of which made most of them). It doesn't yet seem to have latched on to any other page with a similar death grip, although my Linux category is somewhat popular with it right now (with 40 requests today). Probably I'm going to have to keep an eye on this.
It's felt surprisingly nice to have this little irritation pushed out of my life. I know, I shouldn't care that Bingbot is doing bad and annoying things, but I do look at what IP addresses are the most active here (excluding blocked requests) and always having Bingbot show up there was this little poke. And while the operators of Bingbot probably will never notice or know, I can feel that I did a little tiny bit to hold badly behaved web spiders to account.
PS: So far today Bingbot has made just over 1,900 successful requests (HTTP 200 result), just over 1,500 requests that were 403'd, 53 requests that got 304 Not Modified responses, and six '404 no such thing' requests. I'm most surprised at the 304 requests, seeing as Bingbot will routinely repeatedly bang on unchanging URLs without getting 304s. If it could at least conditionally request the same thing over and over so it would mostly get 304s, I would probably feel slightly happier with it. Doing 304s for a few things but not the heavily requested URLs is a bit irritating.