The HTML <pre> element doesn't do very much

June 5, 2019

These days I don't do too much with HTML, so every so often I wind up in a situation where I have to reach back and reconstruct things that once were entirely well known to me. Today, I wound up talking with someone about the <pre> element and what you could and couldn't safely put in it, and it took some time to remember most of the details.

The simple version is that <pre> doesn't escape markup, it only changes formatting, although many simple examples you'll see only use it on plain text so it's not immediately clear. Although it would be nice if <pre> was a general container that you could pour almost arbitrary text into and have it escaped, it's not. If you're writing HTML by hand and you have something to put into a <pre>, you need to escape any markup and HTML entities (much like a <textarea>, although even more so). Alternately, you can actually use this to write <pre> blocks that contain markup, for example links or text emphasis (you might deliberately use bold inside a <pre> to denote generic placeholders that the reader fills in with their specifics).

As with <textarea>, it's easy to overlook this for straightforward cases and to get away without doing any text escaping, especially in modern browsers. A lot of the command lines or code or whatever that we often put into <pre> don't contain things that can be mistaken for HTML markup or HTML entities, and modern browsers will often silently re-interpret things as plain text for you if they aren't validly formatted entities or markup. I myself have written and altered any number of <pre> blocks over the past few years without ever thinking about it, and I'm sure that some of them included '<' or '>' and perhaps '&' (all as part of Unix command lines).

(The MDN page on <pre> includes an example with unescaped < and >. If you play around with similar cases, you'll probably find that what is rendered intact and what is considered to be an unrecognized HTML element that is silently swallowed is quite sensitive to details of formatting and what is included within the '< ... >' run of raw text. Browsers clearly have a lot of heuristics here, some of which have been captured in HTML5's description of tag open state. In HTML5, anything other than an ASCII alpha after the '<' makes it a non-element (in any context, not just in a <pre>).)

I don't know how browser interpretation of various oddities in <pre> content is affected by the declared or assumed HTML DOCTYPE or HTML version the browser assumes, but I wouldn't count on all of them behaving the same outside, perhaps, of HTML5 mode (which at least has specific rules for this). Of course if you're producing HTML with tools instead of writing it by hand, the tools should take care of this for you. That's the only reason that Wandering Thoughts has whatever HTML correctness it does; my DWikiText to HTML rendering code takes care of it all for me, <pre> blocks included.


Comments on this page:

From 78.58.206.110 at 2019-06-05 23:54:10:

Speaking of unescaped HTML, the tag disappeared from the post's title in my feed reader. I see you use <title type="html"> in the Atom feed, which means the feed generator should have escaped the contents twice to avoid this from happening.

By cks at 2019-06-06 00:23:42:

As far as I can tell, the feed generator did escape the title twice. The raw text is:

<title type="html">The HTML &amp;lt;pre&gt; element doesn&#39;t do very much</title>

The leading < has been escaped once to be '&lt;', and then the & has been escaped a second time to be '&amp;'. The trailing > is not escaped in the first pass, but has been turned into '&gt;' in the second.

(If you are using Liferea, this is an outstanding issue, issue #492. There is a patch that may fix it without additional side effects, but I have no idea if it's good.)

<pre> escaping html would add complexity for no benefit. Would it escape other pre tags as well? If so how would you end the outer pre? If not, how would you escape JUST pre tags? With the broad availability of the HTMLElement.innerText method, escaping is trivial and explicit as is.

By Brendan Long at 2019-07-13 15:13:27:

Thanks for the useful test case. I just fixed this in FeedReader for GNOME:

https://github.com/jangernert/FeedReader/pull/923

Apparently we were seriously over-stripping HTML.

Written on 05 June 2019.
« Go channels work best for unidirectional communication, not things with replies
Feed readers and their interpretation of the Atom 'title' element »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Wed Jun 5 21:03:51 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.