Microsoft's Bingbot crawler is relentless for changing pages (it seems)

September 19, 2021

I look at the web logs for Wandering Thoughts every so often. There are many variations from day to day but regardless of what other things change, one thing is as predictable as the sun rising in the morning; every day some MSN Search IP address will be the top single source of traffic, as Bingbot crawls through here. This isn't at all new, as I wrote about Bingbot being out of control back in 2018, but it's somewhere between impressive and depressing just how long this has gone on.

(There are days when Bingbot isn't the top source of traffic, but those are days when someone has turned an abusive crawler lose.)

As it turns out, there is an interesting pattern to what Bingbot is doing. While it's pretty relentless and active in general, one specific URL stands out. Yesterday Bingbot requested the front page of Wandering Thoughts a whopping 1,400 times (today isn't over but it's up to 1,300 times so far). This is a running theme; my blog's front page is by far Bingbot's most requested page regardless of the day.

(Bingbot is also obsessed with things that it can't crawl; today, for example, it made 92 requests for a page that it's barred from with a HTTP 403 response.)

The front page of Wandering Thoughts changes at least once a day (more or less) when a new entry is published, and more often if people leave comments on recent entries (as this updates the count of comments for the entry). However, it doesn't update a hundred times a day even when people are very active with their comments, and Bingbot is being ten times more aggressive than that. I was going to say that Bingbot has other options to discover updates to Wandering Thoughts, such as my Atom syndication feeds, but it turns out that I long ago barred it from fetching a category of URLs here that includes those feeds.

(I have ambivalent feelings about web crawlers fetching syndication feeds. At a minimum, they had better do it well and not excessively, which based on present evidence I suspect Bingbot would not manage.)

Now that I've discovered this Bingbot pattern, I'm tempted to bar it from fetching the front page. The easiest thing to do would be to bar Bingbot entirely, but Bing is a significant enough search engine that I'd feel bad about that (although they don't seem to send me very much search traffic). Of course that might just transfer Bingbot's attention to another of the more or less equivalent pages here that it's currently neglecting, so perhaps I should just leave things as they are even if Bingbot's behavior irritates me.

PS: Of course there could be something else about the front page of Wandering Thoughts that has attracted Bingbot's relentless attention. The reasons for web crawlers to behave as they do are ultimately opaque; all I can really do is come up with reasonable sounding theories.


Comments on this page:

A few years ago, I had a site get hacked, adding a ton of pages of pure SEO spam. Needless to say, those pages 404 now, and have done so for years.

Bingbot still requests hundreds of them per day.

Written on 19 September 2021.
« One major obstacle to unifying the two types of package managers
What the 'proto' field is about in Linux 'ip route' output (and input) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Sep 19 22:02:39 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.