Why file extensions in URLs are a hack

January 5, 2008

When a web server answers your request for a URL, you get both the contents of the URL and its type; ie, whether you are getting HTML, plain text, a PNG image, a PDF, and so on. Specifically, this type information is returned in the form of a MIME type in the Content-Type: HTTP reply header.

(Some web browsers then sniff the data themselves and second guess the web server, generally with ruinous results.)

Now, the web server has to get this type information from somewhere. Ideally (at least for web servers) all files would have metadata attached to them, including their type, and the web server could just use this directly. However, the world is not like that, especially on Unix (the home of the first web servers); files had contents and that was it.

There are a number of plausible ways around this; for example, you could have a file (or a bunch of files) that mapped URLs (or filenames) to content types, or the web server itself could sniff the contents of files to work out their type. But early web servers took the simple way out: they just declared that if filenames had certain extensions, they were certain content types. If your filename ended in .html it would be served as HTML, if it ended in .gif it would be served as a GIF, and so on.

However, this all is nothing but a hack (a useful hack, admittedly) to make up for the lack of real type metadata about files. If Unix filesystems had had content type metadata in 1992 or so, we would probably find the idea of a .html at the end of many of our web URLs to be laughable.

One corollary is that this is in no way required; a webserver can send you any content type with any URL extension, or with none. Thus, web browsers and spiders that make content type decisions based on the URL extension are wrong and broken.

(However, people have expectations and will probably get confused and irritated if your .html URLs are, say, PDFs.)

Sidebar: about .php and .aspx and so on

Generically, web servers don't just need to know what content type to label data as, they need to know how to process a file in general when it is requested; are they supposed to just send the file out as data, or do they need to do something more complicated with it?

Since web servers didn't have better metadata, they used file extensions as convenient way to control this too, and so they grew the knowledge that .php files should not be sent out as data but instead handed to the PHP module to be interpreted and so on.

(This is inconsistently handled in Apache, since there are also ways to say that all files in certain areas are to be executed as programs, not used as normal content. The advantage of the .php approach is that you can freely mix special .php files and regular content files.)

Written on 05 January 2008.
« One problem with the current anti-spam environment
Why the server is the right place to determine web content types »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jan 5 23:05:32 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.