Microsoft's Bingbot crawler is on a relative rampage here

April 30, 2018

For some time, people in various places have been reporting that Microsoft Bing's web crawler is hammering them; for example, Discourse has throttled Bingbot (via). It turns out that Wandering Thoughts is no exception, so I thought I'd generate some numbers on what I'm seeing.

Over the past 11 days (including today), Bingbot has made 40998 requests, amounting to 18% of all requests. In that time it's asked for only 14958 different URLs. Obviously many pages have been requested multiple times, including pages with no changes; the most popular unchanging page was requested almost 600 times. Quite a lot of unchanging pages have been requested several times over this interval (which isn't surprising, since most pages here change only very rarely).

Over this time, Bingbot is the single largest source by user-agent (and the second place source is claimed by a bot that is completely banned; after that come some syndication feed fetchers). For scale, Googlebot has only made 2,800 requests over the past 11 days.

Traffic fluctuates from day to day but there is clearly a steady volume. Traffic for the last 11 days is, going backward from today, 5154 requests, then 2394, 2664, 3855, 1540, 2021, 3265, 7575, 2516, 3592, and finally 6432 requests.

As far as bytes transferred go, Bingbot came in at 119.8 Mbytes over those 11 days. Per day volume is 14.9 Mbytes, then 6.9, 7.3, 11.5, 4.6, 5.8, 8.8, 22.9, 6.7, 10.8, and finally 19.4 Mbytes. On the one hand, the total Bingbot volume by bytes is only 1.5% of my total traffic. On the other hand, syndication feed fetches are about 94% of my volume and if you ignore them and look only at the volume from regular web pages, Bingbot jumps up to 26.9% of the total bytes.

I think that all of this crawling is excessive. It's one thing to want current information; it's another thing to be hammering unchanging pages over and over again. Google has worked out how to get current information with far fewer repeat visits to fewer pages (in part by pulling my syndication feed, presumably using it to drive further crawling). The difference between Google and Bing is especially striking considering that far more people seem to come to Wandering Thoughts from Google searches than come from Bing ones.

(Of course, people coming from Bing could be hiding their Referers far more than people coming from Google do, but I'm not sure I consider that very likely.)

I'm not going to ban Bing(bot), but I certainly do wish I had a useful way to answer their requests very, very slowly in order to discourage them from visiting so much and to be smarter about what they do visit.


Comments on this page:

By Twirrim at 2018-04-30 00:31:28:

It's possible to slow down Bing's bot by using the crawl-delay value in your robots.txt file. It's a "standard" non-standard setting for robots.txt. All the major search engines support it, including Bing.

There's a few approaches with Nginx that you can try. I've done a variation of the following in the past (I couldn't remember what exactly I did, this came up in a search):

http://alex.mamchenkov.net/2017/05/17/nginx-rate-limit-user-agent-control-bots/

Based on server headers, though, it looks like you're using Apache, and the main option there is mod_qos, https://unix.stackexchange.com/a/37483. That looks pretty straightforward to use (arguably more so than Nginx)

A way to "fix" this is to provide a sitemap. On my site, bingbot behaves nicely (with 0.25% of the total requests).

By sam at 2018-04-30 12:58:16:

It's 'only' 2.5 requests/minute on average, which is not that much in the grand scheme of things, though I guess the peaks could be worse (the worst daily breakdown is 5 requests/minute on average). Setting crawl-delay to a minute might be sensible, though, since there's no way that a web crawler should be hitting a site like this as often as that.

Written on 30 April 2018.
« My new 4K HiDPI display really does make a visible difference
You probably need to think about how to handle core dumps on modern Linux servers »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Mon Apr 30 00:06:52 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.