Wandering Thoughts archives

2005-10-21

How ETags and If-Modified-Since headers interact

Part of the fun of writing programs that deal with HTTP is decoding things like RFC 2616 to answer somewhat obscure questions about how various things interact. Today's case is the following question:

When can your web server generate a 304 'content not modified' response if it receives a request with both an If-None-Match and an If-Modified-Since header?

If-None-Match and If-Modified-Since are HTTP request headers used to implement 'conditional GET', a bandwidth saving technique that avoids re-fetching unchanged pages (see here or here for more discussion of this).

(ETag headers come into this because the server's ETag value is what the client will use as its If-None-Match value in the conditional GET request.)

The answer turns out to be in section 13.3.4 of RFC 2616. It is (de-RFC-ized):

You can only generate a 304 response if both headers match; the If-None-Match matches the response's ETag and the If-Modified-Since header matches the Last-Modified.

In the case of If-Modified-Since and Last-Modified, servers may require an exact match instead of merely Last-Modified being no later than If-Modified-Since. As RFC 2616 notes in 14.25, client authors should really just store the Last-Modified result as a string and hork it up in their If-Modified-Since header.

This came up when I threw debugging code into DWiki to see exactly what various people repeatedly pulling my Atom feed without getting bandwidth-efficient 304 responses were sending. One feed reader was sending both headers but making up their own If-Modified-Since instead of just repeating Last-Modified. (DWiki requires an exact match for technical reasons.)

(Whether by accident or by reading RFC 2616 carefully when I wrote the code and then forgetting it, DWiki does the right thing when both headers are present.)

web/ETagAndIfModSinceInteraction written at 17:52:23; Add Comment

MSNbot (still) has problems with binary files

Dating back to our first experiences with msnbot, the MSN Search web crawler, I've known that it was kind of crazy about repeatedly fetching large binary files. Since then, we have pointed this issue out to MSN Search people more than once and switched to using accurate Content-Types. Recently we had a week of MSNbot not refetching those large binaries, so it looked like MSNbot had finally been fixed.

So much for that. Since 7pm Wednesday night, MSNbot has fetched 3.1 gigabytes of various large, unchanging 'application/<definitely not text>' files from us. Highlights of the experience include MSNbot fetching fetching the same 537 megabyte ISO image six times (once less than twenty minutes after the previous fetch).

It is clear that MSNbot simply does not deal correctly with binary files, things served with various 'application/<whatever>' content types. There are a few application/* content types that are appropriate to index (PDFs, for example), but for us MSNbot definitely goes far beyond that.

From things I've heard, it would not surprise me if MSNbot ignores the content-type and just relies on a hard-coded list of URL extensions to not crawl. (Presumably things like .exe and .zip are in there.)

This is completely brain-damaged, since extensions on URLs don't necessarily have anything to do with their content-type. For example, you will search high and low to find a .html extension in DWiki. (Yes, some web servers use the file extension as part of the process to decide on what Content-Type: header to generate. This is an internal implementation detail.)

I doubt we're the only site experiencing this issue. If you have large binary files on your site, I strongly urge you to check your server logs for similar behavior.

web/MSNbotBinariesProblem written at 01:22:57; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.