The (probable) importance of accurate Content-Types
As a result of the MSN search spider going crazy, I am actually paying some attention to our web server logs for a change. This led to me looking up which URLs are responsible for the largest amounts of bandwidth used.
To my surprise, the six largest bandwidth sources were some CD-ROM images in ISO format that we happen to have lying around, the oldest one dating back to 2002. In the last week alone , there were eight requests totaling 3.5 gigabytes of transfers. Who could be that interested in some relatively ratty old ISO images?
Search engines, it turned out. All of the off-campus requests for the ISO images over the past 28 days came from MSNbot, Googlebot, and Yahoo! Slurp. I already knew about the crazy MSN spider, but Googlebot is well behaved; what possible reason could it have for fetching the same 600 megabyte image (last changed May 27th) three times between September 17th and September 21st?
I had previously noticed (while researching CrazyMSNCrawler) that our
web server was serving these ISO images with the Content-Type of
text/plain. (I didn't think much about it at the time, except to
become less annoyed at MSNbot repeatedly looking at them.)
Suddenly the penny dropped: Googlebot probably thought the URL was a huge text file, not an ISO image. Worse, the web server was claiming that the 'text file' was in UTF-8, despite it certainly having non UTF-8 byte sequences.
If my theory is right, no wonder search engines repeatedly fetched the URLs. Each time they were hoping that this time around the text file would have valid UTF-8 that they could use. (Certainly I'd like search engines to re-check web pages that have invalidly encoded content, in the hopes that it gets fixed sooner or later.)
Our web server is now serving files that end in
.iso as Content-Type
application/octet-stream. Time will tell if my theory is right and
the search engines lay off on the ISO images. (Even if it does no good
with search engines, unwary people who click on the links now won't
have their browser trying to show them the 600 megabyte 'text' of the
ISO file, which is a good thing.)
The obvious moral: every so often, take a look at your web server logs. You never know what interesting things you'll find.
(Maybe you'll discover that you're hosting a bulletin board system that averages a couple of hits a second that you hadn't previously noticed. Don't laugh; it happened to us. (It was a legitimate bulletin board system; we just hadn't realized it was quite that active.))
Sidebar: the bonus round of CPU usage
In an effort to speed up transfers to clients by reducing the
amount of data transfered to them, I recently configured the web
server to compress outgoing pages on the fly for various
Content-Types if the client advertised it was cool with this.
(Using the Apache
Of course, one of those Content-Types was
So not only were we doing huge pointless transfers, we were probably burning extra CPU to compress them on the fly. For ISO images where most of the content was likely already compressed, where further compression passes are at best pointless and at worst result in the content expanding.