Where to find specifications on HTTP POST behavior

September 5, 2007

Some IP addresses (probably not friendly ones) have recently taken to making POST submissions to various 'write comments' URLs here with a Content-Type of 'application/x-www-form-urlencoded; charset=UTF-8'. These get rejected by DWiki, because I was quite paranoid when I wrote the POST handling code and so DWiki is quite conservative on what it will accept.

While I was pretty certain that I wasn't losing anything by rejecting these requests, I did get curious to find out if adding a character set to a form POST content-type this way is actually legal, which meant that I wanted to run down where this is actually specified.

(In general including a charset in the content-type on POST is unambiguously allowed by the HTTP specification, so the only question is whether you are allowed to do it specifically in HTTP form POSTs.)

The primary specification of form POST behavior is in the HTML 4.01 specification, which should not have surprised me but did (I looked at the HTTP spec first). Section 17.13.3 describes the process of submitting a form, but you also need 17.13.4 and the definition of the enctype attribute. Unfortunately this doesn't clearly answer the question, since the specification uses very general language.

However, I think that adding a charset parameter has to be allowed by implication. Forms may specify that the server can accept more than one character encoding and leave it up to the client to decide which one to use (the accept-charset <form> attribute). This implies that the client must tell the server which character set it picked, and the form encoding rules provide no place to put this except as a charset parameter on the POST's Content-Type.

(Browsers are encouraged to interpret a missing accept-charset as implying the character set of the HTML page with the form, which is UTF-8 in the case of WanderingThoughts. However, including a charset at all in this case is vanishingly rare.)

I'm still not going to fix DWiki's code right away, since I want to think through what I can and should do if the character set doesn't match. (Bearing in mind that my tolerance for people playing weird HTTP and HTML games is fairly low, since most of them are up to no good.)


Comments on this page:

From 70.17.41.6 at 2007-09-06 20:28:58:

First off, note that a few actual browsers will do this - albeit, the only ones I've heard of in the wild have been WAP browsers on mobile phones.

What you're seeing is likely either trackback spammers (the trackback spec. recommends sending a charset) or comment spammers that are reusing their trackback-spamming code.

Also note that the W3C, when they came up with XForms, explicitly say both that all characters will be represented in UTF-8, albeit percent-escaped (making a single Japanese character into 9 bytes on the wire), and distinctly omit any possibility of putting charset on the application/x-www-form-encoded MIME type. Since they explicitly mention and discuss charset on other MIME types, I take this to mean that user agents should not transmit a charset parameter. Whether POST recipients should ignore any supplied charset or should consider the submission an error probably goes back to the MIME specs.

-- DanielMartin, who's too lazy to go dig up his password

Written on 05 September 2007.
« Features that I wish ZFS had
When you don't want RAID-5 »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Wed Sep 5 23:19:15 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.