Why the server is the right place to determine web content types

January 6, 2008

There are two parties that could be responsible for determining what type of content you are getting in a HTTP response, whether the data is HTML or a PDF or a JPEG or so on: the server or the client. The web has opted to make it determined by the server, and this is the correct place for a simple reason: consistency.

Imagine a world where the clients determine what the content type is. If there is no standard for how clients determine the content type, the consistency problems are obvious; the interpretation of your content is at the mercy of the various different content sniffing algorithms of all of the clients out there in the field, and worse, you have no convincing argument that any particular client is incorrect. Introducing new content types is hell, because you have to design every new type format so that it cannot possibly be interpreted (by any client) as some other undesirable content type.

(The inevitable result is that everyone spends a great deal of time reverse engineering the content sniffing algorithms of first Netscape and then IE, and perhaps someone patents some of the especially good bits.)

If you have a clear standard but it's possible for the same chunk of data to validly be several different content types, you cannot publish arbitrary documents and be sure of how they will be interpreted (without running the standard content sniffing algorithm against them and discovering that clients will see your X as a Y). You also have much the same problem with introducing new content types as before.

If you have a clear standard and it is impossible for one document to validly be of more than one content type, congratulations; you have recreated an awkward version of server side Content-Type: MIME headers, except built in to all of the clients and thus subject to all of their bugs. And you still have serious issues adding new content types.

Regardless of its problems, the advantage of the server sending the content type is that it is unambiguous and thus consistent; every client is sure to see the same view of the content, and if something goes wrong is it clear who is at fault (depending on what went wrong). It is also much easier to upgrade things incrementally, because at least in theory you can be sure that no one changes their view of old content and that everyone will have the correct view of the new content.

(And if any client deviates from this in practice, it is clearly at their own risk and their own fault.)

Or to put it another way: the server determining content type is safer than the clients doing it because only one program (the server) has to get it right, instead of lots of programs (all of the clients).

(And if something goes wrong it does so all the time and in an obvious way.)

Comments on this page:

From at 2008-01-07 11:11:05:

But that's just not true, Apache-httpd has sent text/plain for "unknown" content for so long it's not even funny. So there's a very good argument that any client that always obey's taht is fundamentally broken.

Then there is the problem that MIME types are flat, Ie. you can only give a single value for any one response. So if you are serving .c or .py files, half the websites will send text/plain and another half text/x-c or text/x-csrc where each and every client does randomly broken things depending on the choice (the main problem being that Content-Type is tied to usage way too much). Dito. .css, .html, .vcf ... and that's without thinking about problems like http://example.com/foo.tar.gz which is "application/x-gzip", NOT.

HTTP did some things well, but Content-Type wasn't one of them, at least in the main current implementations.

By nothings at 2008-01-09 08:52:34:

Yeah, I still just don't buy this at all, but there's little I can say to it without merely restating my comments from the previous post.

If the browser is going to do anything with the file, and the file contains a magic number, then the server didn't need to provide a content type; the only thing it could do to affect the behavior is get it wrong. (If the server says it's type A, and the magic number says it's type B because it's actually type B, and the browser/platform tries to process it as type A, it should immediately fail when the code for processing As sees a magic number of type B in the file; it will look at best like a corrupt A. No value is added by having the server attempt to determine its type.)

If the browser is going to do anything with the file, and the file has an extension, and the browser is on a platform that keys on an extension, then the server didn't need to provide a content type; the only thing it could do to affect the behavior is get it wrong.

Etc. enumerating all the cases for browser/platform processing it or not processing it.

As far as I know the vast, vast, vast majority of distinguishable file types do have magic numbers or extensions. (Surely, this must be true since file servers need to determine MIME types somehow in the first place, as you were saying.)

The only case that I can see where anything good happens is for dynamic content (no filename, hence no extension) on a file type with no magic number (e.g. text/html). But that's why I caveated my original comment with "for static content".

Your claim about consistency is, to me, entirely backwards. My argument is also driven by consistency. As a single user with a single browser, what I want is consistent behavior for files regardless of what server they come from. Server-content-type-determination prevents that. (And, again, I've seen this problem in the wild. Sometimes my browser trusts the server and the server gets it wrong.)

The problem with server-chooses is that it's not simply a matter of the server software having to get it right (there's only so many pieces of server software), but that the types change and grow over time, so every deployed server needs to get it right, and this involves individual webmasters updating and deploying (and can have the case that I mentioned previously, that some files have no corresponding webmaster). Whereas, on the other side of the fence, as the types change and grow over time, yes the browser or OS need to also recognize and handle the new types; but they have to do that anyway to use them in the first place. If they don't, the user can't do anything with the file but download it and store it somewhere... in which case they're discarding the MIME type anyway.

Server-determined-types introduce two places for failure: the server has to recognize the type and the browser has to be able to process it. Browser-determined-types have only one place for failure: the browser/OS has to be able to process it. Even if processing and recognizing on the browser/OS are separate and thus are 'two' points of failure, they are far more likely to be updated in synch than the browser and random-server out there.

By cks at 2008-01-09 23:16:41:

Part of the issue may be that there are two views of consistency: the user's view and the author's view. The user wants things to work the same from site to site, while the author wants things to work the same from browser to browser. Browser content sniffing seems clearly right from the user's view and I think it is clearly wrong from the author's view (this entry is sort of my argument about why).

I think that taking the author's view of consistency is more important and works better in the long run, but I should clearly write an actual entry about that rather than treat it as a background presumption.

Written on 06 January 2008.
« Why file extensions in URLs are a hack
Some thoughts on Solaris 10 x86 versus Linux »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jan 6 23:43:06 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.