You should convert wikitext to HTML through an AST

August 13, 2013

Suppose that you are turning wikitext or some other form of structured markup into HTML. The straightforward and often easiest way to do this is to directly generate the HTML as you process the wikitext; when you encounter and parse a particular bit of markup, you immediately output the relevant HTML. Having done this and stubbed my toes very vigorously, I have a bit of advice: you should parse into an AST and then generate HTML from that AST. Yes, it's more code and it seems more indirect, but it has some significant advantages.

The first general advantage is that it decouples the process of parsing your wikitext from the process of generating HTML. Rather than being two sides of a single chunk of code they now communicate through an API, the AST. The AST then gives you a vantage point to examine and verify each side of the process independently (and to evolve them separately). For example, if you're working on the parsing code you can verify that the results are the same by checking the AST instead of having to compare the output HTML.

(If you use automated tests I expect that having an AST in the middle will make both parsing and HTML generation much easier to test. It should also make it much less annoying to evolve either side, because many fewer tests are likely to need changes if you change parsing or HTML generation.)

The second general advantage is that once you have an AST you don't have to output just HTML. For instance (as I mentioned once before) you can output a different wikitext dialect, giving you a fully reliable way of doing wikitext format conversions. Decide that some part of your markup should be different? Now you can fix that. Or you could transition to a significantly different format (eg, to Markdown or MediaWiki from your own custom format) without giving your users and yourself heartburn. All of these options are simply an AST walker away.

(Go shows the power of being able to do this sort of change automatically and reliably with their 'go fix' tool, which they've used to do any number of language and library transitions. My impression is that the existence of go fix makes the Go people more willing to make such changes.)

A smaller advantage of an AST is that it gives you structured information. As I've found out the hard way, a large monolithic blob of HTML is not necessarily what you want. Even when you want HTML (as opposed to metadata) it can be very useful to get things like 'the first paragraph' or 'every top-level section header text' and so on. Generating HTML from an AST also lets you defer certain rendering decisions until very late in the process; this can let you cache more (or cache things more easily).

Another AST advantage is simply that it will almost certainly push you to write a relatively systematic parser for your wikitext. Real parsers are important because they are easier to understand.

(This was inspired by the comment left on my earlier entry about my mistake. My new revised code still falls well short of producing an AST, but if I was writing a new parser from scratch I've realized that I definitely would go to an AST as the intermediate form.)

Written on 13 August 2013.
« The feature (or features) I really want added to xterm
The pragmatics of an HTTP to HTTPS transition »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Aug 13 00:18:29 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.