Some pain points of parsing wikitext (and simplifications that avoid them)

October 11, 2013

I've talked before in general about things that make parsing a general wikitext hard or at least inobvious but I've never given specific examples. Today I want to do so and as an associated issue talk about what simplifications and shortcuts you can take to make it easier.

First, let's talk about errors. Specifically, that in parsing general wikitext there aren't errors; you've always got to keep on producing output. In a wikitext dialect where * is the emphasis delimiter, consider the text 'This is *an error'. In a conventional parser the result of this is an error that boils down to 'unterminated *'. In a generally useful wikitext parser, this must either add an implicit terminating * or treat the visible * as just a literal character. What complicates this is that many conventional parsing techniques generate errors only well after the actual error point; for example, the literal error here might be 'unexpected end of text'. The good news here is that error recovery is actually a well studied thing so there are probably techniques that can be adopted even if you can't use a standard parser generator with its standard terrible error handling.

The two possible simplifications here are to either ignore the possibility of errors entirely or to declare that it's okay for errors to make the wiki page unreadable (yes, these have the same end result). The more restricted you make who can edit the wiki, the easier it will be to get away with this.

In a general wikitext, text style changes are a pain because they can nest and thus overlap. It's valid to have bold italics but '*~bold italics* bold~' either doesn't work or requires you to do peculiar things to synthetically generate a properly nested result. I'm not sure how many parser generators will allow you to write a recursive general production rule for this and how many will require you to write a whole series of restrictive sub-rules for 'text but no <whatever> markers'. Relatedly, text style changes can nest inside other inline elements like the text for links. As a result, sorting out how to handle the following is going to be interesting:

This is *italic [[(with an *italic* link) http://somewhere]] text*.

The simplification is to disallow all nested style changes. Plain text can be set in only one style and all link text and other embedded text doesn't allow style changes at all.

(Related to this is text formatting that actually has the effect of turning off formatting characters to quote what it encloses. The traditional parsing approach pushes such quoting into the lexer but this leaves you with your text markup split between the lexer and the parser proper.)

The big pain point with block level parsing are nesting constructs, especially mixed line-level nesting constructs. For example:

A paragraph.
> A nested paragraph
> > A further nested paragraph,
> > with running multi-line text.
> * A list.
>   * With a nested one.
>   but this is continued.
> * another list entry.
> == And a nested header
> (Yes, people do this.)

In a way, the important thing here is the nested paragraph with running multi-line text. Parsing this wikitext must reassemble the paragraph as a single entity despite there being non-paragraph tokens between the end of one line and the start of its text on the next line. The traditional solution (as used in things like Python) is to have the lexer swallow the line level tokens and generate synthetic tokens for changes in the level and type of nesting. The pain point is that putting this in your lexer is embedding a significant amount of your actual grammar and parsing in the lexer instead of the parser.

The simplification is to disallow complex nesting and have at most nesting that can be easily and relatively generically handled in the lexer (often this means whitespace based only, perhaps just for nested lists). You still have a certain amount of the grammar in the lexer, but at least it's a simple chunk.

(This also neglects any whitespace based parsing you might want to do for things like writing tables in a natural looking plain text style.)

There is also a general simplification possible for many of these pain points: you can do only limited parsing through your formal grammar and then post-process the results after the basic parse has finished. For example you could deal with many or all of the text formatting issues by having your lexing and parsing process only insert marker nodes for potential style changes and so on. The postprocessing pass would then turn these into actual style AST nodes or demote them to plain text as appropriate (or swallow them as redundant, as in the case of italics inside link text inside text that was already italic).

(This approach is unsuitable to single pass 'straight to HTML' wikitext parsing, but you should be parsing to an AST anyways.)

Note that it's certainly possible to create wikitext dialects with all of these simplifications and then to parse them with relatively robust general techniques. But these dialects are clearly limited compared to a fuller wikitext and it's my strong belief that the more general wikitext is going to be better for users outside of limited domains.

(Disclaimer: it's quite possible that I'm missing various general parsing techniques that make some or all of these pain points go away (I have a reasonable but not completely deep exposure to parsing techniques, especially modern ones). If so I'd love to hear about them.)

Written on 11 October 2013.
« How DWiki parses its wikitext (part 1)
Revisiting some bits of ZFS's ZIL with separate log devices »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 11 23:15:40 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.