The (probable) importance of accurate Content-Types

September 23, 2005

As a result of the MSN search spider going crazy, I am actually paying some attention to our web server logs for a change. This led to me looking up which URLs are responsible for the largest amounts of bandwidth used.

To my surprise, the six largest bandwidth sources were some CD-ROM images in ISO format that we happen to have lying around, the oldest one dating back to 2002. In the last week alone , there were eight requests totaling 3.5 gigabytes of transfers. Who could be that interested in some relatively ratty old ISO images?

Search engines, it turned out. All of the off-campus requests for the ISO images over the past 28 days came from MSNbot, Googlebot, and Yahoo! Slurp. I already knew about the crazy MSN spider, but Googlebot is well behaved; what possible reason could it have for fetching the same 600 megabyte image (last changed May 27th) three times between September 17th and September 21st?

I had previously noticed (while researching CrazyMSNCrawler) that our web server was serving these ISO images with the Content-Type of text/plain. (I didn't think much about it at the time, except to become less annoyed at MSNbot repeatedly looking at them.)

Suddenly the penny dropped: Googlebot probably thought the URL was a huge text file, not an ISO image. Worse, the web server was claiming that the 'text file' was in UTF-8, despite it certainly having non UTF-8 byte sequences.

If my theory is right, no wonder search engines repeatedly fetched the URLs. Each time they were hoping that this time around the text file would have valid UTF-8 that they could use. (Certainly I'd like search engines to re-check web pages that have invalidly encoded content, in the hopes that it gets fixed sooner or later.)

Our web server is now serving files that end in .iso as Content-Type application/octet-stream. Time will tell if my theory is right and the search engines lay off on the ISO images. (Even if it does no good with search engines, unwary people who click on the links now won't have their browser trying to show them the 600 megabyte 'text' of the ISO file, which is a good thing.)

The obvious moral: every so often, take a look at your web server logs. You never know what interesting things you'll find.

(Maybe you'll discover that you're hosting a bulletin board system that averages a couple of hits a second that you hadn't previously noticed. Don't laugh; it happened to us. (It was a legitimate bulletin board system; we just hadn't realized it was quite that active.))

Sidebar: the bonus round of CPU usage

In an effort to speed up transfers to clients by reducing the amount of data transfered to them, I recently configured the web server to compress outgoing pages on the fly for various Content-Types if the client advertised it was cool with this. (Using the Apache mod_deflate module.)

Of course, one of those Content-Types was text/plain.

So not only were we doing huge pointless transfers, we were probably burning extra CPU to compress them on the fly. For ISO images where most of the content was likely already compressed, where further compression passes are at best pointless and at worst result in the content expanding.

Written on 23 September 2005.
« Excluding buggy RPMs from a yum repository
Be cautious with numbers in awk »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 23 02:27:46 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.