Wandering Thoughts archives


Banning MSNBot: an open letter to MSN Search

I understand that MSN Seach wants to be bigger than Google in search. One necessary step towards this is that people must be willing to let MSNBot, your search spider, index their content. Today, I changed our robots.txt to ban MSNBot, joining various other people, and so you're a little bit further from your goal. (Probably not enough further that you care very much.)

I'm not doing this because I dislike Microsoft. I'm doing this for a simple reason, the same reason other people have: right now, MSNbot is not a responsible search spider. Here is an incomplete list of the sort of things MSNBot routinely does on this site:

  • repeatedly fetches large binary files, including 500 megabyte ISO images, that are properly served as binary files and have not changed in some time; 21 fetches for 4 files accounting for 3.7 gigabytes of transfers this week. (See MSNbotBinariesProblem)
  • aggressively fetching syndication feeds, many of them unchanging; 1,615 fetches of 329 feeds amounting to 45 megabytes of transfers this week. Half of the top 10 requested feeds have not changed within the past week, yet were requested 12 times or more. (See MSNbotCrazyRSSBehavior)
  • never uses conditional GET, even when aggressively fetching syndication feeds. (See AtomReadersAndCondGet)
  • aggressively recrawls unchanging content and error pages, while neglecting changed content, although this is better than it used to be. (See CrazyMSNCrawler)

All of these behaviors are undesirable. Most of them are aggressively antisocial. None of them should be news to MSN Search, because two months ago when I first started noticing these issues someone I know who worked at Microsoft put me in email contact with some members of the MSN Search team. They got in touch with me, got information from me, and then disappeared; the last time I heard from them was September 30th.

The only change in MSNBot's behavior since September 7th is that it has become a little less enthusiastic about crawling unchanging and error URLs and it stopped pulling our large binary files for a week or two. For all I can tell, these are routine fluctuations in MSNBot's crawling behavior.

That's why I've finally banned MSNbot; not just because it does antisocial things, but also because after two months of waiting I no longer believe that MSNBot will get fixed any time soon. On the Internet, two months is a very long time to tolerate antisocial behavior, far longer than you should expect people to wait. (And if not for the hope created by my brief and fleeting contact with MSN Search programmers, I would not have waited this long.)

So, in summary: I don't enjoy banning MSNBot, but I have lost patience with its bad behavior and don't expect it to change. Enough is enough; out it goes.

Why not just ban MSNBot only from crawling the various bits of bad stuff? Three reasons:

  • it would do nothing to change MSNBot's habit of repeatedly crawling unchanging pages.
  • it makes me play whack-a-mole, where I chase MSNBot around to see what new problem it's grown that I need to hammer down.
  • I don't feel inclined to do MSNBot any more favours, and it would a favour since it's not a popular search engine here. (See also OnBanningSearchEngines)

(There would be a fourth reason, 'you can't do that in robots.txt's syntax', but MSNBot supports wildcard matching in Disallow: so you can do things like disallow crawling specific URL extensions. See here.)

web/BanningMSNBot written at 00:51:54; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.