Wandering Thoughts archives

2006-02-15

Fun with control characters and the web

Yesterday, I broke WanderingThoughts' syndication feeds (and made the main page not validate). I did this by accidentally putting a ^D character into an entry; one of the editors I use makes this unfortunately easy to do by accident, and hard to spot.

Neither HTML 4.01 Transitional nor XML (which the Atom syndication format is a dialect of) allow control characters (apart from tabs (and linefeeds)). A lot of things are forgiving of crappy HTML, but things that eat Atom feeds are usually more picky; for example, the LiveJournal version stopped updating entirely. (That's how I noticed it.)

(Technically this is not the full story for XML; there are a number of other invalid characters and a great big character set swamp. So far I have been madly ducking it.)

Clearly I'd like to avoid having this happen in the future, but there's a problem: since DWiki pages are edited through the filesystem, DWiki is in a position of having pages with bad characters thrust down its throat at page rendering time, so it has to do something with them. What's the best way to communicate the problem while still producing valid (and ideally useful) output?

This isn't entirely new, as DWikiText already has a number of ways for me to screw it up; for example I could put in an invalid macro. The basic principle DWiki uses is 'the rendering must go on': errors should do as little damage as possible, and never kill the entire page. Usually they produce literal text; of course the problem here is that the literal text is the problem.

(Aborting things to report errors is appropriate for situations when you're showing the error reports to the author. When you're showing them to random people, it makes far less sense.)

This pretty much leads to the answer: stray control characters should produce something like '{control character elided}' in both regular HTML and Atom feeds (it's technically challenging to make it be in bold or the like). This keeps things valid, doesn't hide the problem like just deleting the characters would, and doesn't totally destroy the page. Now I just have to code it. Efficiently, since DWikiText rendering is a hot path.

(A lesser way would be to rewrite control characters to things like '^D', but this is a) somewhat more complicated and b) not as noticeable.)

web/CharacterProblems written at 03:15:45; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.