How DWiki parses its wikitext (part 1)

October 11, 2013

I have a long-standing view that parsing wikitext is both hard and not discussed very much; if for some reason you have to write a wikitext parser, there are very few good examples to easily crib from. So I feel like describing at least some stuff about how DWiki parses its wikitext dialect.

(Most of current DWiki wikitext parsing is not my work; the core code was generously improved and rewritten by Daniel Martin many years ago.)

The first big thing is that DWiki actually effectively has two separate parsers. One parser handles embedded formatting in running text (things like fonts, links, and so on) and the other one handles all of the line oriented block level structures like paragraphs, headers, blockquotes, lists, etc. What makes it work is that the block level parser doesn't parse running text immediately for multi-line things like paragraphs; instead it does what I'll call 'paragraph fusion', where all of the lines of raw text are gathered up until the end of the paragraph and only then parsed together. Splitting up the parsing job this way significantly simplifies both sides and makes it much simpler to handle text formatting that spans across multiple physical lines (since you're processing all of the lines at once).

(If I was rewriting DWiki's parsing code from scratch I would completely separate the parser code for running text from the block level parser code.)

Both the block level parser and the running text parser are complicated beasts but the broad form of the running text parser is easier to describe. It's a highly optimized (by Daniel Martin) version of the standard regular expression based substitution process where you walk through the source string, find everything that needs you to do something special (either to change it into some sort of HTML markup or to quote it) and process them. There are a bunch of ad hoc complications that are handled in ad hoc code on the fly, which I've become convinced is the best approach. One thing worth noting is that this step at a time process means that the text parser can be invoked recursively in order to handle the text of a link (which is allowed to have wikitext markup).

(The need for ad hoc things in wikitext parsing really calls for a separate entry, but I'm convinced that a strict and regular wikitext dialect is not 'humane'.)

The running text parser is also the single biggest place where a faster language than Python would improve the code. The current code attempts to do as much as possible in regular expressions for the traditional Python reason and this creates much of its complexity in various ways. A straight reimplementation in eg Go could probably dispense with a lot of this in favour of simpler lexing.

PS: this is going to be an irregular series because to write entries I have to actually understand chunks of DWiki's wikitext rendering code and this takes serious work at this point. If I'm a smart person I'll write all sorts of comments in the source code during this process in addition to these entries.


Comments on this page:

(The need for ad hoc things in wikitext parsing really calls for a separate entry, but I'm convinced that a strict and regular wikitext dialect is not 'humane'.)

I think it could be nice for humans to write and not look TeX-like at the same time. It may not be the most advanced dialect (tables in tables?), but on the other hand, well-formed tree structure to work with is worth simplification.

Here's markup syntax for my wiki engine, here is the grammar definition and (unfortunately hand-written) lexer.

I believe my syntax is close enough to typical wiki to be convenient, but I'm biased as its creator, of course.

By cks at 2013-10-16 12:19:11:

I think that your wiki engine's markup syntax is (deliberately) simplified to a level that removes many (although not all) of the pain points of parsing wikitext. I wrote a longer description of the general issue in WikitextParsingPains.

(It's also not clear to me how your parser handles errors. Error handling is a crucial and hard issue for a general-use wiki engine.)

As for how much people will miss the various omitted features, I think it depends on what the wiki is for. I would miss basically all of them for use here, for example.

Written on 11 October 2013.
« Sun's NeWS was a mistake, as are all toolkit-in-server windowing systems
Some pain points of parsing wikitext (and simplifications that avoid them) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 11 02:35:03 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.