Wandering Thoughts archives

2007-01-29

A gotcha with <textarea>

Textareas are one of those treacherous areas of web programming, because it is really easy to get them 95% right and then never notice that you've fumbled the remaining 5%. The problem area is textareas with initial content, for example blog comment previews; what almost completely works is to just put the raw content into the textarea in your HTML. This approach makes intuitive sense and even works fine if you test by throwing markup in, like '<h1>this is a test</h1>'.

There are only two noticeable problems with this, both of them obscure:

  • any valid entity references in the text will be decoded to their real character values, so '&lt;' turns into '<'.
  • if there's a literal '</textarea>' in the text, it will become the end of the textarea (and your page layout may explode).

Since most people using your website don't do either of these, the simple solution works almost all of the time.

The real problem is that people develop the wrong mental model of what <textarea> does. They think (just as I thought when I started to write DWiki) that <textarea> means 'the browser treats as completely literal all the initial content that I insert here'. The defect with this mental model is exposed by putting a '</textarea>' in the initial content you insert into a textarea: how is the browser supposed to tell the </textarea> you inserted (that it is supposed to ignore) apart from the real </textarea> in your HTML that closes the textarea? The answer is that it can't, and thus that the mental model is wrong.

What is actually going on is that browsers treat the contents of <textarea> as what the HTML 4.01 specification calls #PCDATA: 'document text', in which character entities are allowed and interpreted (technically markup is forbidden; in practice browsers treat it as literal text). It has to be this way; since HTML has no other quoting mechanism besides character entities, allowing character entities is the only way to escape your inserted '</textarea>' so it doesn't terminate the textarea.

This means that you need to quote at least some things in your textarea initial content; minimally '&' and '<', but if you already have a general HTML quoting function (and you should), just use it and be done. (The browser will strip this quoting when it creates the actual initial contents, and thus you will get back the unquoted version when the user POSTs for the next round.)

web/TextareaGotcha written at 23:57:52; Add Comment

How to have your web spider irritate me intensely (part 2)

In the spirit of previous cleverness, here's a simple new technique:

Have your web spider make up random Referer headers.

This wasn't Referer spamming, since the websites in the Referer headers were completely random URLs, apparently drawn from legitimate sites around the Internet (often repeated). Nor were the websites ones that actually linked to us, or had any relationship to the URLs that were being crawled.

Even in low volume this is a sure-fire ticket to our kernel level IP filters, since it insures that we're mostly unable to get anything useful from our Referer logs without a lot of additional work and is therefor deeply irritating.

Today's offender is the IP address 212.52.80.101, which is an unnamed iol.it IP address; it is using a User-Agent value of 'Mozilla/5.0 (arianna.libero.it,ariannaadm@pisa.iol.it)'. It does seem to have requested robots.txt, but of course the User-Agent string gives no clues as to what User-Agent setting in there will turn it off. Ironically it appears to respect nofollow, unlike many other better-behaved web spiders.

web/HowToGetYourSpiderBannedIV written at 12:56:32; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.