Some pain points of parsing wikitext (and simplifications that avoid them)
I've talked before in general about things that make parsing a general wikitext hard or at least inobvious but I've never given specific examples. Today I want to do so and as an associated issue talk about what simplifications and shortcuts you can take to make it easier.
First, let's talk about errors. Specifically, that in parsing general
wikitext there aren't errors; you've always got to
keep on producing output. In a wikitext dialect where
* is the
emphasis delimiter, consider the text '
This is *an error'. In a
conventional parser the result of this is an error that boils down
*'. In a generally useful wikitext parser, this
must either add an implicit terminating
* or treat the visible
* as just a literal character. What complicates this is that
many conventional parsing techniques generate errors only well after
the actual error point; for example, the literal error here might be
'unexpected end of text'. The good news here is that error recovery is
actually a well studied thing so there are probably techniques that can
be adopted even if you can't use a standard parser generator with its
standard terrible error handling.
The two possible simplifications here are to either ignore the possibility of errors entirely or to declare that it's okay for errors to make the wiki page unreadable (yes, these have the same end result). The more restricted you make who can edit the wiki, the easier it will be to get away with this.
In a general wikitext, text style changes are a pain because they
can nest and thus overlap. It's valid to have bold italics but
*~bold italics* bold~' either doesn't work or requires you to do
peculiar things to synthetically generate a properly nested result. I'm
not sure how many parser generators will allow you to write a recursive
general production rule for this and how many will require you to write
a whole series of restrictive sub-rules for 'text but no <whatever>
markers'. Relatedly, text style changes can nest inside other inline
elements like the text for links. As a result, sorting out how to handle
the following is going to be interesting:
This is *italic [[(with an *italic* link) http://somewhere]] text*.
The simplification is to disallow all nested style changes. Plain text can be set in only one style and all link text and other embedded text doesn't allow style changes at all.
(Related to this is text formatting that actually has the effect of turning off formatting characters to quote what it encloses. The traditional parsing approach pushes such quoting into the lexer but this leaves you with your text markup split between the lexer and the parser proper.)
The big pain point with block level parsing are nesting constructs, especially mixed line-level nesting constructs. For example:
A paragraph. > A nested paragraph > > A further nested paragraph, > > with running multi-line text. > * A list. > * With a nested one. > but this is continued. > * another list entry. > > == And a nested header > (Yes, people do this.) Really.
In a way, the important thing here is the nested paragraph with running multi-line text. Parsing this wikitext must reassemble the paragraph as a single entity despite there being non-paragraph tokens between the end of one line and the start of its text on the next line. The traditional solution (as used in things like Python) is to have the lexer swallow the line level tokens and generate synthetic tokens for changes in the level and type of nesting. The pain point is that putting this in your lexer is embedding a significant amount of your actual grammar and parsing in the lexer instead of the parser.
The simplification is to disallow complex nesting and have at most nesting that can be easily and relatively generically handled in the lexer (often this means whitespace based only, perhaps just for nested lists). You still have a certain amount of the grammar in the lexer, but at least it's a simple chunk.
(This also neglects any whitespace based parsing you might want to do for things like writing tables in a natural looking plain text style.)
There is also a general simplification possible for many of these pain points: you can do only limited parsing through your formal grammar and then post-process the results after the basic parse has finished. For example you could deal with many or all of the text formatting issues by having your lexing and parsing process only insert marker nodes for potential style changes and so on. The postprocessing pass would then turn these into actual style AST nodes or demote them to plain text as appropriate (or swallow them as redundant, as in the case of italics inside link text inside text that was already italic).
(This approach is unsuitable to single pass 'straight to HTML' wikitext parsing, but you should be parsing to an AST anyways.)
Note that it's certainly possible to create wikitext dialects with all of these simplifications and then to parse them with relatively robust general techniques. But these dialects are clearly limited compared to a fuller wikitext and it's my strong belief that the more general wikitext is going to be better for users outside of limited domains.
(Disclaimer: it's quite possible that I'm missing various general parsing techniques that make some or all of these pain points go away (I have a reasonable but not completely deep exposure to parsing techniques, especially modern ones). If so I'd love to hear about them.)
How DWiki parses its wikitext (part 1)
I have a long-standing view that parsing wikitext is both hard and not discussed very much; if for some reason you have to write a wikitext parser, there are very few good examples to easily crib from. So I feel like describing at least some stuff about how DWiki parses its wikitext dialect.
(Most of current DWiki wikitext parsing is not my work; the core code was generously improved and rewritten by Daniel Martin many years ago.)
The first big thing is that DWiki actually effectively has two separate parsers. One parser handles embedded formatting in running text (things like fonts, links, and so on) and the other one handles all of the line oriented block level structures like paragraphs, headers, blockquotes, lists, etc. What makes it work is that the block level parser doesn't parse running text immediately for multi-line things like paragraphs; instead it does what I'll call 'paragraph fusion', where all of the lines of raw text are gathered up until the end of the paragraph and only then parsed together. Splitting up the parsing job this way significantly simplifies both sides and makes it much simpler to handle text formatting that spans across multiple physical lines (since you're processing all of the lines at once).
(If I was rewriting DWiki's parsing code from scratch I would completely separate the parser code for running text from the block level parser code.)
Both the block level parser and the running text parser are complicated beasts but the broad form of the running text parser is easier to describe. It's a highly optimized (by Daniel Martin) version of the standard regular expression based substitution process where you walk through the source string, find everything that needs you to do something special (either to change it into some sort of HTML markup or to quote it) and process them. There are a bunch of ad hoc complications that are handled in ad hoc code on the fly, which I've become convinced is the best approach. One thing worth noting is that this step at a time process means that the text parser can be invoked recursively in order to handle the text of a link (which is allowed to have wikitext markup).
(The need for ad hoc things in wikitext parsing really calls for a separate entry, but I'm convinced that a strict and regular wikitext dialect is not 'humane'.)
The running text parser is also the single biggest place where a faster language than Python would improve the code. The current code attempts to do as much as possible in regular expressions for the traditional Python reason and this creates much of its complexity in various ways. A straight reimplementation in eg Go could probably dispense with a lot of this in favour of simpler lexing.
PS: this is going to be an irregular series because to write entries I have to actually understand chunks of DWiki's wikitext rendering code and this takes serious work at this point. If I'm a smart person I'll write all sorts of comments in the source code during this process in addition to these entries.