A gotcha with <textarea>

January 29, 2007

Textareas are one of those treacherous areas of web programming, because it is really easy to get them 95% right and then never notice that you've fumbled the remaining 5%. The problem area is textareas with initial content, for example blog comment previews; what almost completely works is to just put the raw content into the textarea in your HTML. This approach makes intuitive sense and even works fine if you test by throwing markup in, like '<h1>this is a test</h1>'.

There are only two noticeable problems with this, both of them obscure:

  • any valid entity references in the text will be decoded to their real character values, so '&lt;' turns into '<'.
  • if there's a literal '</textarea>' in the text, it will become the end of the textarea (and your page layout may explode).

Since most people using your website don't do either of these, the simple solution works almost all of the time.

The real problem is that people develop the wrong mental model of what <textarea> does. They think (just as I thought when I started to write DWiki) that <textarea> means 'the browser treats as completely literal all the initial content that I insert here'. The defect with this mental model is exposed by putting a '</textarea>' in the initial content you insert into a textarea: how is the browser supposed to tell the </textarea> you inserted (that it is supposed to ignore) apart from the real </textarea> in your HTML that closes the textarea? The answer is that it can't, and thus that the mental model is wrong.

What is actually going on is that browsers treat the contents of <textarea> as what the HTML 4.01 specification calls #PCDATA: 'document text', in which character entities are allowed and interpreted (technically markup is forbidden; in practice browsers treat it as literal text). It has to be this way; since HTML has no other quoting mechanism besides character entities, allowing character entities is the only way to escape your inserted '</textarea>' so it doesn't terminate the textarea.

This means that you need to quote at least some things in your textarea initial content; minimally '&' and '<', but if you already have a general HTML quoting function (and you should), just use it and be done. (The browser will strip this quoting when it creates the actual initial contents, and thus you will get back the unquoted version when the user POSTs for the next round.)


Comments on this page:

By DanielMartin at 2007-01-30 22:31:45:

Not only does failing to do this mean that certain comments get messed up between previewing and submitting the comment again, failing to do this opens your blog up to an XSS attack.

Getting the encoding of the interior of <textarea> isn't just a matter of making your pages "pure" or valid HTML; failing to do so has real, serious security implications.

It surprises me that even people who should know better still screw this up - apparently Moveable Type had this XSS vulnerability sitting in it until version 3.32, released last August.

(Since Six Apart already issued a vendor notice that there's an (unspecified) XSS vulnerability in MT versions prior to 3.34, I don't feel too bad about disclosing this vulnerability now)

Written on 29 January 2007.
« How to have your web spider irritate me intensely (part 2)
Why I am not fond of DHCP in lab environments »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 29 23:57:52 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.