Why file extensions in URLs are a hack

January 5, 2008

When a web server answers your request for a URL, you get both the contents of the URL and its type; ie, whether you are getting HTML, plain text, a PNG image, a PDF, and so on. Specifically, this type information is returned in the form of a MIME type in the Content-Type: HTTP reply header.

(Some web browsers then sniff the data themselves and second guess the web server, generally with ruinous results.)

Now, the web server has to get this type information from somewhere. Ideally (at least for web servers) all files would have metadata attached to them, including their type, and the web server could just use this directly. However, the world is not like that, especially on Unix (the home of the first web servers); files had contents and that was it.

There are a number of plausible ways around this; for example, you could have a file (or a bunch of files) that mapped URLs (or filenames) to content types, or the web server itself could sniff the contents of files to work out their type. But early web servers took the simple way out: they just declared that if filenames had certain extensions, they were certain content types. If your filename ended in .html it would be served as HTML, if it ended in .gif it would be served as a GIF, and so on.

However, this all is nothing but a hack (a useful hack, admittedly) to make up for the lack of real type metadata about files. If Unix filesystems had had content type metadata in 1992 or so, we would probably find the idea of a .html at the end of many of our web URLs to be laughable.

One corollary is that this is in no way required; a webserver can send you any content type with any URL extension, or with none. Thus, web browsers and spiders that make content type decisions based on the URL extension are wrong and broken.

(However, people have expectations and will probably get confused and irritated if your .html URLs are, say, PDFs.)

Sidebar: about .php and .aspx and so on

Generically, web servers don't just need to know what content type to label data as, they need to know how to process a file in general when it is requested; are they supposed to just send the file out as data, or do they need to do something more complicated with it?

Since web servers didn't have better metadata, they used file extensions as convenient way to control this too, and so they grew the knowledge that .php files should not be sent out as data but instead handed to the PHP module to be interpreted and so on.

(This is inconsistently handled in Apache, since there are also ways to say that all files in certain areas are to be executed as programs, not used as normal content. The advantage of the .php approach is that you can freely mix special .php files and regular content files.)


Comments on this page:

By nothings at 2008-01-06 03:42:29:

My standard argument is basically the opposite: MIME types are a bad hack (for static files--the real reason they exist is dynamic files), because if a browser can't handle a particular file type, it doesn't matter that the server knows the right MIME type for it; but if the browser can handle it but the server doesn't know about it, it gets served as, say, text/plain and we get the same old ftp binary/ascii corruption once again.

This actually happens, to ruinous effect as well.

You might argue that the server is broken if it has a file and doesn't know its type. And if every static file has an associated full-fledged webmaster, than maybe it's reasonable. But when the web is populated by tons of individuals posting their own files, it's the wrong call. (This was a notorious problem with geocities back in the day, for instance.)

The problem is two competing forces: unknown file types (on the server) should be served unaltered (i.e. as binary), but unknown file types should possibly be displayed as text (on the browser).

The solution is that translation should happen on the browser side, not the server side, but oh well.

This (IMO) mistake is suprisingly common. Not even SVN gets this right (server vs. browser translation; obviously no mime types are involved); if you accidentally check in a binary file as ascii you're hosed. If the SVN server simply stored whatever data the client handed it, unmodified, and then translated it to the desired format on demand, it wouldn't be a problem. Instead the server canonicalizes before storing it. (Although non-canonicalizing could cause problems for storing diffs when the line endings alternate each revision. Still, you can either accept the repository bloat, or make a smarter diff that's aware of all possible endings and encodes that change separately; either solution seems better than accepting occasional corruption of binary files from your system dedicated to, you know, storing your data defensively.)

Bottom line, in both cases: the server doesn't actually care what type the data is, only the browser does. So why make the server have to know?

From 70.254.120.52 at 2008-01-06 22:48:03:

In my (admittedly limited) experience, MIME types are frequently deceptive. Take the case of grabbing the images shown on a set of pages. If you want to store those images locally, going by the URL extension won't work, because dynamically-generated images may have no extension, or even worse something like .php.

The naive response would be to use the MIME type the server gives you. image/extn and there you go. You quickly run into a problem: about 1% of images are returned with a text/plain Content-type. At that point you may as well drop MIME types altogether and just look at the returned data to determine its type.

So I must agree with the previous commenter. Let's not try to make the server (and its administrator) spend effort that doesn't really benefit them; the browser benefits and it doesn't need a content-type anyway.

By cks at 2008-01-06 23:44:05:

My answer got long enough that I made it a new entry, WhyServerContentType.

Written on 05 January 2008.
« One problem with the current anti-spam environment
Why the server is the right place to determine web content types »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jan 5 23:05:32 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.