MSNbot (still) has problems with binary files

October 21, 2005

Dating back to our first experiences with msnbot, the MSN Search web crawler, I've known that it was kind of crazy about repeatedly fetching large binary files. Since then, we have pointed this issue out to MSN Search people more than once and switched to using accurate Content-Types. Recently we had a week of MSNbot not refetching those large binaries, so it looked like MSNbot had finally been fixed.

So much for that. Since 7pm Wednesday night, MSNbot has fetched 3.1 gigabytes of various large, unchanging 'application/<definitely not text>' files from us. Highlights of the experience include MSNbot fetching fetching the same 537 megabyte ISO image six times (once less than twenty minutes after the previous fetch).

It is clear that MSNbot simply does not deal correctly with binary files, things served with various 'application/<whatever>' content types. There are a few application/* content types that are appropriate to index (PDFs, for example), but for us MSNbot definitely goes far beyond that.

From things I've heard, it would not surprise me if MSNbot ignores the content-type and just relies on a hard-coded list of URL extensions to not crawl. (Presumably things like .exe and .zip are in there.)

This is completely brain-damaged, since extensions on URLs don't necessarily have anything to do with their content-type. For example, you will search high and low to find a .html extension in DWiki. (Yes, some web servers use the file extension as part of the process to decide on what Content-Type: header to generate. This is an internal implementation detail.)

I doubt we're the only site experiencing this issue. If you have large binary files on your site, I strongly urge you to check your server logs for similar behavior.


Comments on this page:

From 192.88.60.254 at 2005-10-21 17:06:40:

Here's what I want to know - why does msnbot never use HTTP/1.1 ? The only queries we ever get from them are all HTTP/1.0, which must mean that they have trouble indexing people who are virtual hosted, or maybe that they rely on servers to accept the Host: header even in an HTTP/1.0 request.

Compare this to google, which flings at us both HTTP/1.1 and HTTP/1.0 queries in about equal measure.

-- DanielMartin

By cks at 2005-10-21 18:11:29:

I think almost all servers will accept Host: even in HTTP/1.0 requests (it's apparently strongly recommended that HTTP/1.0 clients and servers both use it). The pragmatic odds of a web server that has virtual hosts but ignores Host: in HTTP/1.0 requests is probably pretty low.

(Such a server is going to cause problems for people besides MSN. There are a number of web proxies that only do HTTP/1.0, partly because supporting HTTP/1.1 is complicated.)

Written on 21 October 2005.
« A Python surprise: exiting is an exception
How ETags and If-Modified-Since headers interact »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Fri Oct 21 01:22:57 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.