MSNbot goes crazy with RSS feeds

September 28, 2005

Unfortunately, the issue I discussed in ADynamicSitePeril is not just a theoretical problem. And it's my old friend MSNbot, the MSN Search web crawler, that's responsible.

It's been known for some time that MSNbot is very aggressive with syndication feeds (for example, see here). The official explanation is that they want to have a really up to date search index, and to get breaking news they have to poll RSS feeds frequently.

Well, bully for them. However, it's their job to do this in a way that doesn't slam the people whose content they index. And they're visibly failing at this job; as it stands today, MSNbot isn't a responsible RSS crawler.

MSNbot has several significant problems that compound together in its crawling of my syndication feeds:

  1. MSNbot pulls lots of Atom syndication feeds from me despite duplicate content.
  2. MSNbot rechecks feeds quite often, regardless of how infrequently they change.
  3. MSNbot always refetches feeds, regardless of whether or not they've changed.

I am not the only blog in the world with multiple subset feeds, and I use a feed format that gives each post a globally unique identifier. It is well within MSNbot's power to notice this sort of duplication and eliminate it, so they pull only the most general syndication feed. Indeed, they had better be working to eliminate duplicate content from their search index as it is, especially given spam blogs.

(And who knows if their index has decided to consider my blog a spam blog? After all, from their perspective they are pulling huge amounts of 'duplicate' content from all of my various feeds. Certainly WanderingThoughts doesn't seem to come up in MSN searches that I would expect to find it in.)

The bandwidth wasted to give people's feed readers pointless copies of unchanged feeds has been a concern since at least 2002 (see for example here and here). Fortunately, HTTP already has two methods for avoiding this and using at least one of them has been 'Best Practices' for feed readers for several years. MSNbot ignores both, behavior that would get the author of a feed reader program pilloried.

I could hold my nose about either problem on its own, but put together they go well over the edge. Then MSNbot makes the situation worse with its frequent rechecks, generally of old data (the same problem as in CrazyMSNCrawler).

As an example, on Monday MSNbot pulled 42 RSS feeds for 18 different URLs. Only at most 3 of those URLs (6 requests total) had changed in the last, oh, week. The five most popular URLs (4 requests each) were for the calendar navigation pages for days in July. Since it is not July any more and has not been for some time, those feeds have not exactly been changing recently. And Monday was a comparatively slow day for MSNbot's RSS crawling.

Excess rechecking matters particularly to me because my syndication feeds are dynamically generated at every request, so MSNbot is costing us both wasted bandwidth and wasted CPU cycles. And the more fun it decides to have with all of the Atom feeds it can autodiscover from me, the worse it'll get. (So far it's up to 276 of them.)

Comments on this page:

From at 2005-09-28 23:55:32:

Of course, I'll point out that DWiki doesn't support If-Modified-Since correctly, really, because it only supports exact matches with the If-Modified-Since header and the Last-Modified header.

Also, any page that includes one of the many macros that call context.unrel_time won't support If-Modified-Since at all, and neither will their atom feeds.

Are there any pages that MSNbot is starting to get, but then getting a 304 (Not modified) status from? If so, I'd look for commonalities between the pages it is grabbing over and over again.

If not, then I guess I'd double-check that If-Modified-Since is indeed properly supported on atom feeds, and if so conclude that msnbot is just screwed up. (again)

Then again, any web spider that can't parse perfectly valid html doesn't warrant huge expectations of doing other things right.

-- DanielMartin

By cks at 2005-09-29 00:57:07:

Atom feeds deliberately don't evaluate macros (they go through wikirend.terserend instead of the usual renderer). I did this to somewhat bound the amount of computation and flailing an Atom feed would be responsible for.

That's a good point about DWiki requiring an exact If-Modified-Since match, though. I should read the specification and do a better implementation at some point.

However, it doesn't seem to matter; 28 days of our logs show no instance of MSNbot getting a 304 response to any request, even for static pages where Apache does all the work itself and presumably supports the full specification. I suspect that MSN Search simply doesn't bother to store ETag and Last-Modified data.

By cks at 2005-09-29 01:53:09:

The issues with If-Modified-Since are kind of tricky. Since DWiki pages are subject to arbitrary time shifts (in particular, the timestamp can go backwards if you revert to an older version of the page), DWiki has to insist on an exact timestamp match in order to generate a 304. I believe this is within the spirit of 'If-Modified-Since'; if the timestamp is different the resource has been modified 'since' the timestamp was issued, even if the new timestamp is older.

DWiki does technically err in not attempting to parse the HTTP date string into an actual timestamp and comparing that, instead of the current mere string comparison. However, RFC 2616 does suggest that clients should use the literal text they got from the server for maximum reliability.

From at 2005-09-30 16:37:36:

Okay, this is minorly evil, and a bit petty, but since the people behind MSN search can't seem to be bothered with parsing standard html, why not use that against them? That is, use MSNbot's own html parsing bugs to prevent access to your atom feeds. The idea is that MSNbot can't handle links that are put onto the page in a certain way, even though anything that claims to be able to parse HTML should have no problem with this. So, act on that idea and form your atom links so as to contain a &#X..; character escape.

Specifically, in

def gendisclink(url):
    if (len(url) > 2):
        url = url[0] + ('&#X%02x;' % ord(url[1])) + url[2:]
    return '<link rel="alternate" type="application/atom+xml" href="%s">' % url

Then, tools that know how to actually parse html will have no problem since this is a perfectly valid html escape. MSNbot, however, will ask for the url /& from your server instead of the atom feed.

You can use ('&#x%02X;' % ord(url[1])) instead if there's some badly behaved RSS reader that can't handle &#X escapes but can handle &#x (not inconceivable - &#X is valid in HTML but NOT valid in XML; both accept &#x escapes). However, for all I know MSNbot may actually handle &#x escapes correctly.

Of course you could just deny it access flat-out, but this is geekier.

By cks at 2005-10-01 00:43:42:

There are a number of technical issues, but they're dancing around the real one: fundamentally, I'm not interested in going to that much work to help MSN Search. If I need to do anything at all to deal with MSNbot, I'll be banning it entirely.

Written on 28 September 2005.
« Some hints on debugging memory leaks in Python programs
Something C programmers should not write in Python code »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Sep 28 03:17:19 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.