Wandering Thoughts archives

2005-09-28

MSNbot goes crazy with RSS feeds

Unfortunately, the issue I discussed in ADynamicSitePeril is not just a theoretical problem. And it's my old friend MSNbot, the MSN Search web crawler, that's responsible.

It's been known for some time that MSNbot is very aggressive with syndication feeds (for example, see here). The official explanation is that they want to have a really up to date search index, and to get breaking news they have to poll RSS feeds frequently.

Well, bully for them. However, it's their job to do this in a way that doesn't slam the people whose content they index. And they're visibly failing at this job; as it stands today, MSNbot isn't a responsible RSS crawler.

MSNbot has several significant problems that compound together in its crawling of my syndication feeds:

  1. MSNbot pulls lots of Atom syndication feeds from me despite duplicate content.
  2. MSNbot rechecks feeds quite often, regardless of how infrequently they change.
  3. MSNbot always refetches feeds, regardless of whether or not they've changed.

I am not the only blog in the world with multiple subset feeds, and I use a feed format that gives each post a globally unique identifier. It is well within MSNbot's power to notice this sort of duplication and eliminate it, so they pull only the most general syndication feed. Indeed, they had better be working to eliminate duplicate content from their search index as it is, especially given spam blogs.

(And who knows if their index has decided to consider my blog a spam blog? After all, from their perspective they are pulling huge amounts of 'duplicate' content from all of my various feeds. Certainly WanderingThoughts doesn't seem to come up in MSN searches that I would expect to find it in.)

The bandwidth wasted to give people's feed readers pointless copies of unchanged feeds has been a concern since at least 2002 (see for example here and here). Fortunately, HTTP already has two methods for avoiding this and using at least one of them has been 'Best Practices' for feed readers for several years. MSNbot ignores both, behavior that would get the author of a feed reader program pilloried.

I could hold my nose about either problem on its own, but put together they go well over the edge. Then MSNbot makes the situation worse with its frequent rechecks, generally of old data (the same problem as in CrazyMSNCrawler).

As an example, on Monday MSNbot pulled 42 RSS feeds for 18 different URLs. Only at most 3 of those URLs (6 requests total) had changed in the last, oh, week. The five most popular URLs (4 requests each) were for the calendar navigation pages for days in July. Since it is not July any more and has not been for some time, those feeds have not exactly been changing recently. And Monday was a comparatively slow day for MSNbot's RSS crawling.

Excess rechecking matters particularly to me because my syndication feeds are dynamically generated at every request, so MSNbot is costing us both wasted bandwidth and wasted CPU cycles. And the more fun it decides to have with all of the Atom feeds it can autodiscover from me, the worse it'll get. (So far it's up to 276 of them.)

web/MSNbotCrazyRSSBehavior written at 03:17:19;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.