Wandering Thoughts archives

2005-11-15

The scope of the peril of having a highly dynamic web site

In ADynamicSitePeril, I wrote about how dynamically generating various aspects of this blog means that WanderingThoughts has a lot of Atom feeds. Since MSN Search's aggressive fetching of several hundred of those feeds has been on my mind recently, I thought it would help to put some concrete numbers on this.

WanderingThoughts is built on top of a directory hierarchy, where each entry is a file and categories are subdirectories. Right now (before I post this entry), it has 10 directories (nine 'categories' and the top level), 186 entries, and four administrative files (two entry indexes, the recent comments index, and the sidebar).

A web spider that doesn't crawl through links marked 'nofollow' will see 375 directories (each with an Atom feed). A web spider that crawls through all links will see 964 directories, with the additional directories coming mostly from crawling all of the range-based 'previous N entries' links.

(It's a popular belief that marking links 'nofollow' means spiders will never crawl through them. Technically this is false; the original description just calls for the link to give no credit to the target. In practice, all of the common web spiders just don't follow 'nofollow' links, so you can abuse them this way.)

Over the course of the last 28 days, MSNBot fetched 365 different Atom feeds in WanderingThoughts, with the most recently created feed dating from November 11th. Since each entry typically creates two new virtual directories, MSNBot came very close to finding every non-nofollow virtual directory that existed as of last Friday.

DynamicSitePerilScope written at 01:15:42; Add Comment

2005-11-14

Banning MSNBot: an open letter to MSN Search

I understand that MSN Seach wants to be bigger than Google in search. One necessary step towards this is that people must be willing to let MSNBot, your search spider, index their content. Today, I changed our robots.txt to ban MSNBot, joining various other people, and so you're a little bit further from your goal. (Probably not enough further that you care very much.)

I'm not doing this because I dislike Microsoft. I'm doing this for a simple reason, the same reason other people have: right now, MSNbot is not a responsible search spider. Here is an incomplete list of the sort of things MSNBot routinely does on this site:

  • repeatedly fetches large binary files, including 500 megabyte ISO images, that are properly served as binary files and have not changed in some time; 21 fetches for 4 files accounting for 3.7 gigabytes of transfers this week. (See MSNbotBinariesProblem)
  • aggressively fetching syndication feeds, many of them unchanging; 1,615 fetches of 329 feeds amounting to 45 megabytes of transfers this week. Half of the top 10 requested feeds have not changed within the past week, yet were requested 12 times or more. (See MSNbotCrazyRSSBehavior)
  • never uses conditional GET, even when aggressively fetching syndication feeds. (See AtomReadersAndCondGet)
  • aggressively recrawls unchanging content and error pages, while neglecting changed content, although this is better than it used to be. (See CrazyMSNCrawler)

All of these behaviors are undesirable. Most of them are aggressively antisocial. None of them should be news to MSN Search, because two months ago when I first started noticing these issues someone I know who worked at Microsoft put me in email contact with some members of the MSN Search team. They got in touch with me, got information from me, and then disappeared; the last time I heard from them was September 30th.

The only change in MSNBot's behavior since September 7th is that it has become a little less enthusiastic about crawling unchanging and error URLs and it stopped pulling our large binary files for a week or two. For all I can tell, these are routine fluctuations in MSNBot's crawling behavior.

That's why I've finally banned MSNbot; not just because it does antisocial things, but also because after two months of waiting I no longer believe that MSNBot will get fixed any time soon. On the Internet, two months is a very long time to tolerate antisocial behavior, far longer than you should expect people to wait. (And if not for the hope created by my brief and fleeting contact with MSN Search programmers, I would not have waited this long.)

So, in summary: I don't enjoy banning MSNBot, but I have lost patience with its bad behavior and don't expect it to change. Enough is enough; out it goes.

Why not just ban MSNBot only from crawling the various bits of bad stuff? Three reasons:

  • it would do nothing to change MSNBot's habit of repeatedly crawling unchanging pages.
  • it makes me play whack-a-mole, where I chase MSNBot around to see what new problem it's grown that I need to hammer down.
  • I don't feel inclined to do MSNBot any more favours, and it would a favour since it's not a popular search engine here. (See also OnBanningSearchEngines)

(There would be a fourth reason, 'you can't do that in robots.txt's syntax', but MSNBot supports wildcard matching in Disallow: so you can do things like disallow crawling specific URL extensions. See here.)

BanningMSNBot written at 00:51:54; Add Comment

2005-11-02

How well do some Atom feed fetchers do conditional GETs?

'Conditional GET' is the HTTP technique used to save bandwidth by not re-fetching unchanged pages. Using conditional GET is especially important for things that fetch syndication feeds (RSS or Atom), because people usually check feeds much more often than they revisit web pages. (This is another good reference for syndication feed reader authors.)

WanderingThoughts has a lot of syndication feeds and the main ones are quite big. Recently, partly prompted by issues with MSNbot, I decided to take a look at what was fetching my syndication feeds and how well they did conditional GET. So I looked at my data for about the past week, chosen in part because I recently added detailed logging about what conditional GET related headers get sent by things fetching my Atom feeds.

(First, I have to say that I like having readers and we have a lot of spare bandwidth. If your syndication feed reader does badly here, it is absolutely not a request for you to unsubscribe.)

Conditional GET can be done with ETag / If-None-Match, or with If-Modified-Since; ETag is better. Perfect scores go to the feed fetchers that always use it: SharpReader, Bloglines, LiveJournal, Feedster Crawler, and NetNewsWire.

A few feed fetchers lose some points from the East German judge:

  • liferea lost out on a perfect score because while it always uses If-Modified-Since, it only sometimes uses If-None-Match (only if it's fetched a changed feed since the program was started; it doesn't store the ETag value in its persistent database).
  • Yahoo Slurp and PubSub-RSS-Reader only use If-Modified-Since, which works but is not ideal.

The 'nice try, but...' award goes to:

  • Rojo 1.0, who support ETag but unfortunately make up their own timestamps for If-Modified-Since, and send both headers. This doesn't work, for reasons explained here and here.
  • BlogSearch, which sends If-None-Match but stripped of the quotes that DWiki's ETag value has. (This may be RFC-compliant, in which case I need to fix DWiki.)

A number of syndication feed fetchers don't support conditional GET; they don't even bother to send If-Modified-Since headers, and always wind up re-fetching my syndication feeds (when they fetch the main one, this is 300K or so a shot). They are:

  • everyone's friend MSNbot, who is by far the most active fetcher of my Atom feeds.
  • 'madicon RSS Reader', which appears to be a syndication feed reader addon for Lotus Notes. Working in the Notes environment may make it difficult to store the per-feed information necessary to support conditional GET.
  • 'Waggr_Fetcher)', http://www.waggr.com/; this appears to be a web-based feed reader.
  • kinjabot, another web-based aggregator thing.
  • FeedFetcher-Google and 'Googlebot/2.1' (fetching as a browser); these surprised me, because I expected Google to do better.
  • BlogPulse, although to be fair it only visited three times in the last week. (It's an interesting blog search engine; I wish it indexed WanderingThoughts more. Unfortunately they want an email address to submit blog URLs, which is an immediate turnoff these days.)
AtomReadersAndCondGet written at 02:16:15; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.