Fun with control characters and the web

February 15, 2006

Yesterday, I broke WanderingThoughts' syndication feeds (and made the main page not validate). I did this by accidentally putting a ^D character into an entry; one of the editors I use makes this unfortunately easy to do by accident, and hard to spot.

Neither HTML 4.01 Transitional nor XML (which the Atom syndication format is a dialect of) allow control characters (apart from tabs (and linefeeds)). A lot of things are forgiving of crappy HTML, but things that eat Atom feeds are usually more picky; for example, the LiveJournal version stopped updating entirely. (That's how I noticed it.)

(Technically this is not the full story for XML; there are a number of other invalid characters and a great big character set swamp. So far I have been madly ducking it.)

Clearly I'd like to avoid having this happen in the future, but there's a problem: since DWiki pages are edited through the filesystem, DWiki is in a position of having pages with bad characters thrust down its throat at page rendering time, so it has to do something with them. What's the best way to communicate the problem while still producing valid (and ideally useful) output?

This isn't entirely new, as DWikiText already has a number of ways for me to screw it up; for example I could put in an invalid macro. The basic principle DWiki uses is 'the rendering must go on': errors should do as little damage as possible, and never kill the entire page. Usually they produce literal text; of course the problem here is that the literal text is the problem.

(Aborting things to report errors is appropriate for situations when you're showing the error reports to the author. When you're showing them to random people, it makes far less sense.)

This pretty much leads to the answer: stray control characters should produce something like '{control character elided}' in both regular HTML and Atom feeds (it's technically challenging to make it be in bold or the like). This keeps things valid, doesn't hide the problem like just deleting the characters would, and doesn't totally destroy the page. Now I just have to code it. Efficiently, since DWikiText rendering is a hot path.

(A lesser way would be to rewrite control characters to things like '^D', but this is a) somewhat more complicated and b) not as noticeable.)


Comments on this page:

By DanielMartin at 2006-02-15 12:11:34:

In XML 1.0, there is absolutely no way to have a character with a value less than 32 aside from CR, LF, and TAB. Try any kind of &-encoding at all, put it in a CDATA section, etc. - it won't work. Text containing low-32 control characters cannot be shoveled into XML 1.0 no matter what you do. (Oddly enough, control characters in the range 128-143 are just fine, and you don't even have to &-encode them)

I've always considered this a big gaping defficiency in XML 1.0 that makes it truly annoying to generate tools that translate from some legacy system into XML. This means that in addition to the regular < and & escaping you've got to do, you've also got to strip off any garbage characters which just might happen to be sitting in your legacy system. If you don't, your entire XML document is rejected, the tools on the other end throw nasty errors, and everything blows up. The tools on the other end treat a single character 28 (which we've seen very occasionally in some of our legacy data), or its equivalent "&#28;", as though the entire document were binary garbage. They really are that fragile. For something billed as an interchange format, this is a definite bug, not a feature.

Note that XML 1.1 is not so bizarrely anal about not allowing low-32 character data; it also treats the two control character ranges of latin1 identically. (That is, they are allowed as character data but must be &-encoded) However, despite it being recommended by w3c.org for almost 2 years now, I've seen almost no movement from the makers of XML tools towards it.

By cks at 2006-02-16 03:06:22:

Liferea actually had an interestingly novel reaction to the bad Atom feed. Instead of rejecting it outright, it seems to have dropped the rest of the elements in that entry, which left the entry without a title, but carried on with the rest of the entries in the feed. (I could tell, because it also told me that another entry had been updated.)

I suspect that Liferea uses a somewhat relaxed XML parser. Since there's apparently a lot of broken RSS feeds out there, this isn't too surprising.

By cks at 2006-02-16 03:17:04:

PS: that would be Liferea as in Liferea, the Linux feed reader.

(I realized after posting my comment that maybe not everyone remembers quite as much about the software I use and mention in passing once or twice as I do.)

Written on 15 February 2006.
« An advantage of using a non-standard shell
An interesting IDE to SATA migration problem »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Feb 15 03:15:45 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.