Wandering Thoughts archives

2005-09-28

MSNbot goes crazy with RSS feeds

Unfortunately, the issue I discussed in ADynamicSitePeril is not just a theoretical problem. And it's my old friend MSNbot, the MSN Search web crawler, that's responsible.

It's been known for some time that MSNbot is very aggressive with syndication feeds (for example, see here). The official explanation is that they want to have a really up to date search index, and to get breaking news they have to poll RSS feeds frequently.

Well, bully for them. However, it's their job to do this in a way that doesn't slam the people whose content they index. And they're visibly failing at this job; as it stands today, MSNbot isn't a responsible RSS crawler.

MSNbot has several significant problems that compound together in its crawling of my syndication feeds:

  1. MSNbot pulls lots of Atom syndication feeds from me despite duplicate content.
  2. MSNbot rechecks feeds quite often, regardless of how infrequently they change.
  3. MSNbot always refetches feeds, regardless of whether or not they've changed.

I am not the only blog in the world with multiple subset feeds, and I use a feed format that gives each post a globally unique identifier. It is well within MSNbot's power to notice this sort of duplication and eliminate it, so they pull only the most general syndication feed. Indeed, they had better be working to eliminate duplicate content from their search index as it is, especially given spam blogs.

(And who knows if their index has decided to consider my blog a spam blog? After all, from their perspective they are pulling huge amounts of 'duplicate' content from all of my various feeds. Certainly WanderingThoughts doesn't seem to come up in MSN searches that I would expect to find it in.)

The bandwidth wasted to give people's feed readers pointless copies of unchanged feeds has been a concern since at least 2002 (see for example here and here). Fortunately, HTTP already has two methods for avoiding this and using at least one of them has been 'Best Practices' for feed readers for several years. MSNbot ignores both, behavior that would get the author of a feed reader program pilloried.

I could hold my nose about either problem on its own, but put together they go well over the edge. Then MSNbot makes the situation worse with its frequent rechecks, generally of old data (the same problem as in CrazyMSNCrawler).

As an example, on Monday MSNbot pulled 42 RSS feeds for 18 different URLs. Only at most 3 of those URLs (6 requests total) had changed in the last, oh, week. The five most popular URLs (4 requests each) were for the calendar navigation pages for days in July. Since it is not July any more and has not been for some time, those feeds have not exactly been changing recently. And Monday was a comparatively slow day for MSNbot's RSS crawling.

Excess rechecking matters particularly to me because my syndication feeds are dynamically generated at every request, so MSNbot is costing us both wasted bandwidth and wasted CPU cycles. And the more fun it decides to have with all of the Atom feeds it can autodiscover from me, the worse it'll get. (So far it's up to 276 of them.)

MSNbotCrazyRSSBehavior written at 03:17:19; Add Comment

2005-09-26

A peril of having a highly dynamic web site

WanderingThoughts, this blog, is built on top of DWiki, my sprawling wiki-oid program. DWiki is layered on top of a normal Unix directory hierarchy and does things by having different ways of looking at it (called 'views' in DWiki terminology); all of DWiki's support for blogs is implemented with general features that can be used on any directory. The only difference between WanderingThoughts and say the DWiki help area is that WanderingThoughts is set up to default to the 'blog' view and has some template skinning to add the sidebar.

(This way I could tell myself I was just writing some small additional features for my existing program, instead of yet another blogging system.)

Other things are also done as general features. The calendar and range based blog navigation is 'virtual directories' that can be applied to any real directory. Atom syndication feeds are just another view of a directory hierarchy, any directory hierarchy. Because you can compose this generality together, it's trivial to do things like get an Atom feed of the five most recently changed pages in all of CSpace; just tack on '/latest/5/' to the root CSpace URL of /~cks/space/, then add '?atom' to select the Atom syndication view, and it all works.

All of this is very general and dynamic (since everything is generated on the fly). And therein lies our peril, right at the intersection of all of these dynamic website features. Follow along:

  1. Every directory has an Atom feed.
  2. Blog calendar navigation creates lots of links to lots of (virtual) directories.
  3. Every page with a regular Atom feed has automatic feed discovery enabled, because this is the friendly thing to do.

Between the top level directory plus category subdirectories, times day and month and year pages for every day with posts, WanderingThoughts probably has thousands of subdirectories. Each of these directories has its own Atom syndication feed, each of which can be autodiscovered by anything that crawls CSpace through those handy links.

Boy, I hope that any crawlers doing that are smart enough to realize they have a bunch of duplicate feeds.

Boy am I an optimist.

(DWiki can't mark the calendar navigation links 'nofollow', because I want web spiders to follow them to find older blog entries. How else are they going to do it? (Web spiders not infrequently shy away from links with '?' or other URL parameters in them, which makes me nervous about counting on the 'See As Normal' link to lead spiders to plain directory traversal.))

ADynamicSitePeril written at 01:23:02; Add Comment

2005-09-23

The (probable) importance of accurate Content-Types

As a result of the MSN search spider going crazy, I am actually paying some attention to our web server logs for a change. This led to me looking up which URLs are responsible for the largest amounts of bandwidth used.

To my surprise, the six largest bandwidth sources were some CD-ROM images in ISO format that we happen to have lying around, the oldest one dating back to 2002. In the last week alone , there were eight requests totaling 3.5 gigabytes of transfers. Who could be that interested in some relatively ratty old ISO images?

Search engines, it turned out. All of the off-campus requests for the ISO images over the past 28 days came from MSNbot, Googlebot, and Yahoo! Slurp. I already knew about the crazy MSN spider, but Googlebot is well behaved; what possible reason could it have for fetching the same 600 megabyte image (last changed May 27th) three times between September 17th and September 21st?

I had previously noticed (while researching CrazyMSNCrawler) that our web server was serving these ISO images with the Content-Type of text/plain. (I didn't think much about it at the time, except to become less annoyed at MSNbot repeatedly looking at them.)

Suddenly the penny dropped: Googlebot probably thought the URL was a huge text file, not an ISO image. Worse, the web server was claiming that the 'text file' was in UTF-8, despite it certainly having non UTF-8 byte sequences.

If my theory is right, no wonder search engines repeatedly fetched the URLs. Each time they were hoping that this time around the text file would have valid UTF-8 that they could use. (Certainly I'd like search engines to re-check web pages that have invalidly encoded content, in the hopes that it gets fixed sooner or later.)

Our web server is now serving files that end in .iso as Content-Type application/octet-stream. Time will tell if my theory is right and the search engines lay off on the ISO images. (Even if it does no good with search engines, unwary people who click on the links now won't have their browser trying to show them the 600 megabyte 'text' of the ISO file, which is a good thing.)

The obvious moral: every so often, take a look at your web server logs. You never know what interesting things you'll find.

(Maybe you'll discover that you're hosting a bulletin board system that averages a couple of hits a second that you hadn't previously noticed. Don't laugh; it happened to us. (It was a legitimate bulletin board system; we just hadn't realized it was quite that active.))

Sidebar: the bonus round of CPU usage

In an effort to speed up transfers to clients by reducing the amount of data transfered to them, I recently configured the web server to compress outgoing pages on the fly for various Content-Types if the client advertised it was cool with this. (Using the Apache mod_deflate module.)

Of course, one of those Content-Types was text/plain.

So not only were we doing huge pointless transfers, we were probably burning extra CPU to compress them on the fly. For ISO images where most of the content was likely already compressed, where further compression passes are at best pointless and at worst result in the content expanding.

AccurateContentTypeImportance written at 02:27:46; Add Comment

2005-09-16

Web browsers make bad text editors

As editors web browsers have all sorts of problems, such as a narrow view of the text (squeezed into a small to tiny text box), only (very) basic editing operations, and a severe lack of features. Some of the features can be somewhat fixed up by the website, like spellchecking and saving drafts in progress, but even then they tend to be awkward. (You can read one person's grumbles about the problem in the context of blogging here.)

This makes it puzzling that more and more people are designing systems that call for web browsers to fill the role of text editors. Often the web browsers are the only available text editors. 'Web-based' is big (blogs, wikis, bug tracking systems, and so on) and all too often web-based means 'only accessible through the web'.

Every time I see this, I wince.

Bad software creates a kind of friction. In the face of friction, people have to work harder and be more motivated in order to use your software. Some of them won't bother; some of them will wind up grumpy. The more friction your systems have, the larger this effect.

Most people have a finite amount of energy and time that they're willing to devote to writing things. The more work the text editing takes, the less they have left to spend on creating and refining the actual content. And the content is the important thing, so effort spent merely editing it is basically lost.

(Certainly this is the case for me. More than once I've concluded that fighting text editing in my browser simply would take more energy than I have available for writing at the moment, and not written comments on this or that.)

In many cases the people writing the content are probably the most important users of your system (especially in the case of bug tracking system). The corollaries are obvious.

DWiki has deliberately adopted the contrarian position of making file editing with a real editor the primary (and so far only) way to work on pages. I feel strongly that this is a big part of why I've been willing to keep writing at least an entry a day for almost three months now; it is simply that much easier.

(The extra features enabled by real file editing are very nice, too, like drafts and notes and outlines that stick around as long as I want them, and an ideas file.)

(Updated: my apologies to people who are seeing this twice. I realized I had given this entry the wrong name, and in DWiki changing an entry's name also changes its identity in syndication feeds.)

BrowsersMakeBadEditors written at 01:20:57; Add Comment

2005-09-07

The MSN search spider has gone crazy

I'm not the first person to notice this, but the MSN search spider ('msnbot' if you want to search for it in the user agent portion of weblogs, or just look for requests from the 207.46.98.* subnet) seems to have gone crazy.

Specifically, the MSNBot is repeatedly and aggressively crawling completely unchanging URLs, while paying much less attention to changing ones. On this server, the past 29 day's worth of logs show:

  • the number one hit, with more than 60 fetches attempted, is to an URL that doesn't exist and hasn't existed for a minimum of months.
  • of the next ten most-often fetched web pages, the most recently changed one was last updated in 2003. Some have not changed since before the turn of the century. Each was fetched by the MSNbot more than 20 times; the most popular one was fetched almost 58 times (just about twice a day).
  • MSNbot is perfectly willing to repeatedly fetch huge files, transferring 2.5 gigabytes worth of them from us. The most recently changed huge file that MSNbot gave its love to (four times) last changed in 2004. The most popular three files (15 fetches each on average) last changed in 1996, 1996, and 1997 respectively; that was good for 363 megabytes of data transfers.

Overall, MSNbot requested over 5100 pages from us; judging from Referer logs, exactly 72 MSN searches brought visitors here.

On a second system I have logs going back to May 29th. Since then, MSNbot requested 364,000 pages from the website, with about 5,000 MSN searches bringing people to us. The most popular MSNbot pages to request are again completely crazy:

  • the most popular URL hasn't existed for years (550+ requests)
  • eight actually existing web pages got 100 requests or more from the MSNbot. The newest last changed in 2001; the oldest was last changed in 1994, and it got 199 requests (making it the third most requested page, narrowly beaten out by two pages last changed in 1999).
  • MSNbot made 12 requests for a 15 megabyte PDF file last changed in January.

On both websites, many of the most requested URLs don't exist. While looking periodically to see if nonexistent URLs that people are still linking to have reappeared is a good idea, I don't see why it should be the MSNBot's most popular thing to do.

It's clear that how often the MSN spider looks at web pages has very little to do with how often they change. For example, the index page for WanderingThoughts, which changes at least every day, was fetched only 14 times over 29 days (and in a completely uneven pattern, with several skips). Meanwhile, my top level home page, unchanged since May of 2001, was fetched 45 times (more than once a day).

Fortunately the University of Toronto has lots of bandwidth to spare, and neither web server is exactly straining under the load.

(The good news is that this issue may reach the ears of some MSN Search people and they'll hopefully fix whatever is wrong. If so, I'll update this entry with appropriate information.)

Update, September 9th or so: some people from MSN Search have been in contact with me and now have various details (like specific URLs and so on). There's no other developments (including no particularly apparent change in MSNbot's crawling patterns).

Update, September 30th: the MSN Search people have gotten back in contact with me again. Unfortunately, MSNbot continues to have various issues with how it crawls us, including significantly excessive tranfers of ISO images. Since the MSN Search people are talking with me, I am not currently planning to take aggressive action against MSNbot.

Update, November 14th: with no contact from MSN Search in over a month and continued bad MSNbot behavior, I have given up and banned MSNBot from crawling our website. See BanningMSNBot.

CrazyMSNCrawler written at 02:19:19; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.