2008-01-09
There are two different situations for content-types
It's struck me that there are two different situations for content-type sniffing in web agents: when the web browser knows that it is requesting a specific sort of thing, and when it can have no expectations about what it will get. The first situation happens when the browser is doing things like loading CSS stylesheets or retrieving inlined images; the second situation usually happens when the user clicks on a link.
When the browser knows what it is supposed to be getting, it already knows what it's trying to do with the data, and either it's good data or it's not (and the browser has to check). Because the situation is already unambiguous, insisting on the web server sending the right content type as well as valid data is just being legalistic.
(Even with inlined images, a given browser has a relatively small number of image formats it supports and generally all of them have good validity and sanity checks.)
As a result, I believe that in practice pretty much all web agents are fairly forgiving of content types in this situation, and may outright ignore them no matter how technically incorrect this is. (In a sense this is a rerun of the strict validation versus loose validation arguments, and we know how those ended.)
When the browser does not know what it is supposed to be getting, the options are many, the ambiguity is huge (as is the potential harm), and many of the potential formats themselves generally lack good validity checks. Here, browsers use the server content type to disambiguate the situation, and if the results are wrong, well, at least it's not their fault.
(Sometimes the ambiguity is unresolvable. If I send you a valid HTML
document as text/plain, is it because I want you to read the HTML
source or because I screwed up?)
2008-01-06
Why the server is the right place to determine web content types
There are two parties that could be responsible for determining what type of content you are getting in a HTTP response, whether the data is HTML or a PDF or a JPEG or so on: the server or the client. The web has opted to make it determined by the server, and this is the correct place for a simple reason: consistency.
Imagine a world where the clients determine what the content type is. If there is no standard for how clients determine the content type, the consistency problems are obvious; the interpretation of your content is at the mercy of the various different content sniffing algorithms of all of the clients out there in the field, and worse, you have no convincing argument that any particular client is incorrect. Introducing new content types is hell, because you have to design every new type format so that it cannot possibly be interpreted (by any client) as some other undesirable content type.
(The inevitable result is that everyone spends a great deal of time reverse engineering the content sniffing algorithms of first Netscape and then IE, and perhaps someone patents some of the especially good bits.)
If you have a clear standard but it's possible for the same chunk of data to validly be several different content types, you cannot publish arbitrary documents and be sure of how they will be interpreted (without running the standard content sniffing algorithm against them and discovering that clients will see your X as a Y). You also have much the same problem with introducing new content types as before.
If you have a clear standard and it is impossible for one document to
validly be of more than one content type, congratulations; you have
recreated an awkward version of server side Content-Type: MIME
headers, except built in to all of the clients and thus subject to all
of their bugs. And you still have serious issues adding new content
types.
Regardless of its problems, the advantage of the server sending the content type is that it is unambiguous and thus consistent; every client is sure to see the same view of the content, and if something goes wrong is it clear who is at fault (depending on what went wrong). It is also much easier to upgrade things incrementally, because at least in theory you can be sure that no one changes their view of old content and that everyone will have the correct view of the new content.
(And if any client deviates from this in practice, it is clearly at their own risk and their own fault.)
Or to put it another way: the server determining content type is safer than the clients doing it because only one program (the server) has to get it right, instead of lots of programs (all of the clients).
(And if something goes wrong it does so all the time and in an obvious way.)
2008-01-05
Why file extensions in URLs are a hack
When a web server answers your request for a URL, you get both
the contents of the URL and its type; ie, whether you are getting
HTML, plain text, a PNG image, a PDF, and so on. Specifically, this
type information is returned in the form of a MIME type in the
Content-Type: HTTP reply header.
(Some web browsers then sniff the data themselves and second guess the web server, generally with ruinous results.)
Now, the web server has to get this type information from somewhere. Ideally (at least for web servers) all files would have metadata attached to them, including their type, and the web server could just use this directly. However, the world is not like that, especially on Unix (the home of the first web servers); files had contents and that was it.
There are a number of plausible ways around this; for example, you could
have a file (or a bunch of files) that mapped URLs (or filenames) to
content types, or the web server itself could sniff the contents of
files to work out their type. But early web servers took the simple way
out: they just declared that if filenames had certain extensions, they
were certain content types. If your filename ended in .html it would
be served as HTML, if it ended in .gif it would be served as a GIF,
and so on.
However, this all is nothing but a hack (a useful hack, admittedly)
to make up for the lack of real type metadata about files. If Unix
filesystems had had content type metadata in 1992 or so, we would
probably find the idea of a .html at the end of many of our web URLs
to be laughable.
One corollary is that this is in no way required; a webserver can send you any content type with any URL extension, or with none. Thus, web browsers and spiders that make content type decisions based on the URL extension are wrong and broken.
(However, people have expectations and will probably get confused
and irritated if your .html URLs are, say, PDFs.)
Sidebar: about .php and .aspx and so on
Generically, web servers don't just need to know what content type to label data as, they need to know how to process a file in general when it is requested; are they supposed to just send the file out as data, or do they need to do something more complicated with it?
Since web servers didn't have better metadata, they used file extensions
as convenient way to control this too, and so they grew the knowledge
that .php files should not be sent out as data but instead handed to
the PHP module to be interpreted and so on.
(This is inconsistently handled in Apache, since there are also ways to
say that all files in certain areas are to be executed as programs, not
used as normal content. The advantage of the .php approach is that you
can freely mix special .php files and regular content files.)