How DWiki parses its wikitext (part 1)

October 11, 2013

I have a long-standing view that parsing wikitext is both hard and not discussed very much; if for some reason you have to write a wikitext parser, there are very few good examples to easily crib from. So I feel like describing at least some stuff about how DWiki parses its wikitext dialect.

(Most of current DWiki wikitext parsing is not my work; the core code was generously improved and rewritten by Daniel Martin many years ago.)

The first big thing is that DWiki actually effectively has two separate parsers. One parser handles embedded formatting in running text (things like fonts, links, and so on) and the other one handles all of the line oriented block level structures like paragraphs, headers, blockquotes, lists, etc. What makes it work is that the block level parser doesn't parse running text immediately for multi-line things like paragraphs; instead it does what I'll call 'paragraph fusion', where all of the lines of raw text are gathered up until the end of the paragraph and only then parsed together. Splitting up the parsing job this way significantly simplifies both sides and makes it much simpler to handle text formatting that spans across multiple physical lines (since you're processing all of the lines at once).

(If I was rewriting DWiki's parsing code from scratch I would completely separate the parser code for running text from the block level parser code.)

Both the block level parser and the running text parser are complicated beasts but the broad form of the running text parser is easier to describe. It's a highly optimized (by Daniel Martin) version of the standard regular expression based substitution process where you walk through the source string, find everything that needs you to do something special (either to change it into some sort of HTML markup or to quote it) and process them. There are a bunch of ad hoc complications that are handled in ad hoc code on the fly, which I've become convinced is the best approach. One thing worth noting is that this step at a time process means that the text parser can be invoked recursively in order to handle the text of a link (which is allowed to have wikitext markup).

(The need for ad hoc things in wikitext parsing really calls for a separate entry, but I'm convinced that a strict and regular wikitext dialect is not 'humane'.)

The running text parser is also the single biggest place where a faster language than Python would improve the code. The current code attempts to do as much as possible in regular expressions for the traditional Python reason and this creates much of its complexity in various ways. A straight reimplementation in eg Go could probably dispense with a lot of this in favour of simpler lexing.

PS: this is going to be an irregular series because to write entries I have to actually understand chunks of DWiki's wikitext rendering code and this takes serious work at this point. If I'm a smart person I'll write all sorts of comments in the source code during this process in addition to these entries.

Written on 11 October 2013.
« Sun's NeWS was a mistake, as are all toolkit-in-server windowing systems
Some pain points of parsing wikitext (and simplifications that avoid them) »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 11 02:35:03 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.