The MSN search spider has gone crazy

September 7, 2005

I'm not the first person to notice this, but the MSN search spider ('msnbot' if you want to search for it in the user agent portion of weblogs, or just look for requests from the 207.46.98.* subnet) seems to have gone crazy.

Specifically, the MSNBot is repeatedly and aggressively crawling completely unchanging URLs, while paying much less attention to changing ones. On this server, the past 29 day's worth of logs show:

  • the number one hit, with more than 60 fetches attempted, is to an URL that doesn't exist and hasn't existed for a minimum of months.
  • of the next ten most-often fetched web pages, the most recently changed one was last updated in 2003. Some have not changed since before the turn of the century. Each was fetched by the MSNbot more than 20 times; the most popular one was fetched almost 58 times (just about twice a day).
  • MSNbot is perfectly willing to repeatedly fetch huge files, transferring 2.5 gigabytes worth of them from us. The most recently changed huge file that MSNbot gave its love to (four times) last changed in 2004. The most popular three files (15 fetches each on average) last changed in 1996, 1996, and 1997 respectively; that was good for 363 megabytes of data transfers.

Overall, MSNbot requested over 5100 pages from us; judging from Referer logs, exactly 72 MSN searches brought visitors here.

On a second system I have logs going back to May 29th. Since then, MSNbot requested 364,000 pages from the website, with about 5,000 MSN searches bringing people to us. The most popular MSNbot pages to request are again completely crazy:

  • the most popular URL hasn't existed for years (550+ requests)
  • eight actually existing web pages got 100 requests or more from the MSNbot. The newest last changed in 2001; the oldest was last changed in 1994, and it got 199 requests (making it the third most requested page, narrowly beaten out by two pages last changed in 1999).
  • MSNbot made 12 requests for a 15 megabyte PDF file last changed in January.

On both websites, many of the most requested URLs don't exist. While looking periodically to see if nonexistent URLs that people are still linking to have reappeared is a good idea, I don't see why it should be the MSNBot's most popular thing to do.

It's clear that how often the MSN spider looks at web pages has very little to do with how often they change. For example, the index page for WanderingThoughts, which changes at least every day, was fetched only 14 times over 29 days (and in a completely uneven pattern, with several skips). Meanwhile, my top level home page, unchanged since May of 2001, was fetched 45 times (more than once a day).

Fortunately the University of Toronto has lots of bandwidth to spare, and neither web server is exactly straining under the load.

(The good news is that this issue may reach the ears of some MSN Search people and they'll hopefully fix whatever is wrong. If so, I'll update this entry with appropriate information.)

Update, September 9th or so: some people from MSN Search have been in contact with me and now have various details (like specific URLs and so on). There's no other developments (including no particularly apparent change in MSNbot's crawling patterns).

Update, September 30th: the MSN Search people have gotten back in contact with me again. Unfortunately, MSNbot continues to have various issues with how it crawls us, including significantly excessive tranfers of ISO images. Since the MSN Search people are talking with me, I am not currently planning to take aggressive action against MSNbot.

Update, November 14th: with no contact from MSN Search in over a month and continued bad MSNbot behavior, I have given up and banned MSNBot from crawling our website. See BanningMSNBot.

Comments on this page:

From at 2005-09-14 14:00:06:

I once had a back-and-forth with someone at msn search over the broken HTML parser in msnbot. On my pages I put my mailto: link in a form that most spambots could decipher, yet don't seem to bother with for one reason or another:

href="m&#97;ilto&#X3a;&#37;6&#100;&#x61;rtin&#64;snowplow&#X2e;org">email me</a>.

That's pretty tame compared to some mailto-link-hiding stuff that's out there, but this uses no javascript and validates perfectly.

This results in msnbot (and quite a few other broken bots, too) sending requests like this in the logs:

GET /martin/mailto& HTTP/1.0

Anyway, that was almost a year ago now, they confirmed the problem, the people who I was in touch with seemed very technically with-it and knowledgeable, and yet the msie spider is still doing it. So don't hold your breath.

By cks at 2005-09-18 02:32:08:

I'm being an optimist, since this actually affects the quality of MSN's search results, and people's willingness to let the MSN spider crawl their site (which also affects the quality of MSN's search results).

Written on 07 September 2005.
« Two sides of Internet identity
Another Fedora Core 4 Anaconda bug »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Sep 7 02:19:19 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.