It feels surprisingly good to block Bingbot from my blog front page

July 11, 2022

Back last year I wrote about how Microsoft's Bingbot relentlessly crawled the front page of Wandering Thoughts. On pretty much every day, a single Bingbot IP would request the front page of Wandering Thoughts a thousand times or more. Back in that entry I said I was tempted to block Bingbot from doing that; recently, my vague irritation with Bingbot's ongoing behavior reached a boiling point, and I actually did that.

Since Wandering Thoughts is served through an Apache with mod_rewrite enabled, the block was relatively simple to implement. I just check for an exact match of the request URI (since Bingbot never uses variations) and then match the user agent. By default, successive RewriteCond conditions all must be true so this just works.

(The hardest bit was re-reading the mod_rewrite documentation yet again to determine that I wanted to match against REQUEST_URI. This would have been faster if I'd actually fully read the documentation and followed the cross reference to expression variables.)

That Wandering Thoughts' front page now gives Bingbot 403s hasn't particularly slowed it down. Over the past almost 24 hours, Bingbot has made just under 1,400 requests for the URL from two different IPs (one of which made most of them). It doesn't yet seem to have latched on to any other page with a similar death grip, although my Linux category is somewhat popular with it right now (with 40 requests today). Probably I'm going to have to keep an eye on this.

It's felt surprisingly nice to have this little irritation pushed out of my life. I know, I shouldn't care that Bingbot is doing bad and annoying things, but I do look at what IP addresses are the most active here (excluding blocked requests) and always having Bingbot show up there was this little poke. And while the operators of Bingbot probably will never notice or know, I can feel that I did a little tiny bit to hold badly behaved web spiders to account.

PS: So far today Bingbot has made just over 1,900 successful requests (HTTP 200 result), just over 1,500 requests that were 403'd, 53 requests that got 304 Not Modified responses, and six '404 no such thing' requests. I'm most surprised at the 304 requests, seeing as Bingbot will routinely repeatedly bang on unchanging URLs without getting 304s. If it could at least conditionally request the same thing over and over so it would mostly get 304s, I would probably feel slightly happier with it. Doing 304s for a few things but not the heavily requested URLs is a bit irritating.


Comments on this page:

Did you try any of the things suggested in comments on your 2018 article ? And is there any way to file a bug report against the Bingbot ?

By cks at 2022-07-14 11:32:16:

Unfortunately, I can't really do any of the things suggested in the comments in the 2018 entry because I'm not the only thing on this Apache server and I don't control the server configuration. I do have a top level sitemap, but it's (obviously) not the web server top level sitemap.

(I don't currently have a <meta> header that points at the sitemap. I think you can do that these days and someday I probably should.)

Written on 11 July 2022.
« My distrust of multi-factor authentication's account recovery story
Getting the names of your Linux software RAID devices to stick »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Mon Jul 11 23:10:38 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.