2007-01-29
A gotcha with <textarea>
Textareas are one of those treacherous areas of web programming, because it is really easy to get them 95% right and then never notice that you've fumbled the remaining 5%. The problem area is textareas with initial content, for example blog comment previews; what almost completely works is to just put the raw content into the textarea in your HTML. This approach makes intuitive sense and even works fine if you test by throwing markup in, like '<h1>this is a test</h1>'.
There are only two noticeable problems with this, both of them obscure:
- any valid entity references in the text will be decoded to their real character values, so '<' turns into '<'.
- if there's a literal '</textarea>' in the text, it will become the end of the textarea (and your page layout may explode).
Since most people using your website don't do either of these, the simple solution works almost all of the time.
The real problem is that people develop the wrong mental model of what
<textarea>
does. They think (just as I thought when I started to write
DWiki) that <textarea>
means 'the browser treats as completely literal
all the initial content that I insert here'. The defect with this mental
model is exposed by putting a '</textarea>
' in the initial content
you insert into a textarea: how is the browser supposed to tell the
</textarea> you inserted (that it is supposed to ignore) apart from the
real </textarea> in your HTML that closes the textarea? The answer is
that it can't, and thus that the mental model is wrong.
What is actually going on is that browsers treat the contents
of <textarea>
as what the HTML 4.01 specification calls #PCDATA
:
'document text', in which character entities are allowed and
interpreted (technically markup is forbidden; in practice browsers
treat it as literal text). It has to be this way; since HTML has no
other quoting mechanism besides character entities, allowing character
entities is the only way to escape your inserted '</textarea>
' so it
doesn't terminate the textarea.
This means that you need to quote at least some things in your textarea
initial content; minimally '&' and '<', but if you already have a
general HTML quoting function (and you should), just use it and be done.
(The browser will strip this quoting when it creates the actual initial
contents, and thus you will get back the unquoted version when the user
POST
s for the next round.)
How to have your web spider irritate me intensely (part 2)
In the spirit of previous cleverness, here's a simple new technique:
Have your web spider make up random Referer headers.
This wasn't Referer spamming, since the websites in the Referer headers were completely random URLs, apparently drawn from legitimate sites around the Internet (often repeated). Nor were the websites ones that actually linked to us, or had any relationship to the URLs that were being crawled.
Even in low volume this is a sure-fire ticket to our kernel level IP filters, since it insures that we're mostly unable to get anything useful from our Referer logs without a lot of additional work and is therefor deeply irritating.
Today's offender is the IP address 212.52.80.101, which is an unnamed
iol.it IP address; it is using a User-Agent value of 'Mozilla/5.0
(arianna.libero.it,ariannaadm@pisa.iol.it)'. It does seem to have
requested robots.txt
, but of course the User-Agent string gives
no clues as to what User-Agent setting in there will turn it off.
Ironically it appears to respect nofollow, unlike
many other better-behaved web spiders.