(Not) parsing wikitext

November 12, 2011

I have a confession: I don't know how to parse WikiText. Oh, I know how to process it (using regular expressions and ad-hoc code), but I don't know how to write a formal parser for it, something based on parsing theory in order to have all of the advantages of real parsers. Parsers matter because they're much easier to write, to maintain, and to reason about than ad-hoc code; ad-hoc processing code is ultimately just a big pile of mud. Complex, fragile mud.

I've been thinking about this for some time, and I've more or less concluded that there are three aspects of wikitext that make it hard to handle with traditional parsing techniques and tools. The first issue is that wikitext is a mixture of stream-oriented and line-oriented content. All of the parsing techniques that I learned in my now long-ago compilers course were built around stream oriented parsing, where you simply have a running stream of tokens and the most line oriented stuff you have is that end of line is a delimiter. This is fine for parsing running text, but pretty much all wikitexts also have an entire second layer of line to line markup to signal things like headings, lists, quoted (or indented) text, preformatted text, and so on. You need to parse this second layer as well as the running text, and you generally can't parse each line's running text independently because markup constructs in running text can span physical lines.

(Even Python is parsed with a straightforward stream oriented parser under the hood; I wrote about it here and here. Unfortunately the technique that Python uses can't be extended to parsing wikitext.)

I'm sure that there are parsing techniques for dealing with line oriented languages (my impression is that a number of early pre-Algol languages were significantly line oriented), but I didn't learn any such techniques and I think they're now out of favour because everyone sensibly shifted to stream oriented language designs decades ago.

The second big issue is that unlike a traditional language, wikitext doesn't have any errors. In a traditional parser, a parse production either succeeds or results in an error. In wikitext, there are no errors and something like an incomplete external link is simply raw text; '[[some text' is not an error, it is '[[some text'. The presence of errors allows parsers to avoid huge amounts of backtracking (or equivalently, a huge amount of lookahead). The dominant forms of parsing today have very little lookahead because people found that that was both possible and efficient for computer languages (and then people carefully fiddled with languages so that they didn't need lookahead). More advanced things like parsing expression grammars can tolerate more lookahead, but my impression is that even a PEG based grammar struggles with wikitext's potential massive backtracking.

The third issue is that the proper meaning of a lot of wikitext is highly context dependent. For example, consider whitespace at the start of a line. In DWikiText, the standard meaning of this is preformatted text, but if we're in a list context it instead means a continued list entry. Similar issues can bedevil parsing various font change characters in running text, especially if you want to insist on proper nesting. I think that some of this can be handled naturally by things like parsing expression grammars, but I'm not sure that all of it can be.

(I read a paper from sweble.org that admitted they didn't handle determining what was and wasn't a font change in the parser but instead deferred all of that to a post-processing pass over the resulting AST. Maybe it's still possible to do this in clear, systematic code instead of gradually devolving back to ad-hoc processing.)

Comments on this page:

From at 2011-11-12 04:20:35:

you might want to try Text::WikiText (http://search.cpan.org/dist/wikitext-perl/lib/Text/WikiText.pm) ; it is not python but Perl, but should do what you need to do without reinventing another wheel.

From at 2011-11-12 13:14:07:

I think parsing Markdown or wiki-markup "ad hoc" is the right thing to do and isn't so terrible. Especially if you've written them a few times so you have some idea what the trouble areas are.

You can think of it with formal parser theory like lex/yacc, it's just your tokens are generally only one character so you might as well skip lex, and the parse structure is so free form (anything nested inside anything else) that it's simpler to just track the things directly without recursion (and it allows silly wrong nesting like turning on italics, text, turning on bold, text, turning off italics, text, turning off bold), so it makes more sense to just parse character by character and toggle various states depending on what character you get and what other states you're in. Just don't try to use regexes.

For example here's one of my markdown-esque parsers in C: http://pastebin.com/qCGHtqvg

-- nothings (at some point many months ago my password stopped working)

Written on 12 November 2011.
« My view of the right way to copy lists in Python
A confession about our ZFS configuration »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Nov 12 01:41:04 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.