(Not) parsing wikitext
I have a confession: I don't know how to parse WikiText. Oh, I know how to process it (using regular expressions and ad-hoc code), but I don't know how to write a formal parser for it, something based on parsing theory in order to have all of the advantages of real parsers. Parsers matter because they're much easier to write, to maintain, and to reason about than ad-hoc code; ad-hoc processing code is ultimately just a big pile of mud. Complex, fragile mud.
I've been thinking about this for some time, and I've more or less concluded that there are three aspects of wikitext that make it hard to handle with traditional parsing techniques and tools. The first issue is that wikitext is a mixture of stream-oriented and line-oriented content. All of the parsing techniques that I learned in my now long-ago compilers course were built around stream oriented parsing, where you simply have a running stream of tokens and the most line oriented stuff you have is that end of line is a delimiter. This is fine for parsing running text, but pretty much all wikitexts also have an entire second layer of line to line markup to signal things like headings, lists, quoted (or indented) text, preformatted text, and so on. You need to parse this second layer as well as the running text, and you generally can't parse each line's running text independently because markup constructs in running text can span physical lines.
(Even Python is parsed with a straightforward stream oriented parser under the hood; I wrote about it here and here. Unfortunately the technique that Python uses can't be extended to parsing wikitext.)
I'm sure that there are parsing techniques for dealing with line oriented languages (my impression is that a number of early pre-Algol languages were significantly line oriented), but I didn't learn any such techniques and I think they're now out of favour because everyone sensibly shifted to stream oriented language designs decades ago.
The second big issue is that unlike a traditional language, wikitext
doesn't have any errors. In a traditional parser, a parse production
either succeeds or results in an error. In wikitext, there are no errors
and something like an incomplete external link is simply raw text;
[[some text' is not an error, it is '[[some text'. The presence
of errors allows parsers to avoid huge amounts of backtracking (or
equivalently, a huge amount of lookahead). The dominant forms of
parsing today have very little lookahead because people found that
that was both possible and efficient for computer languages (and then
people carefully fiddled with languages so that they didn't need
lookahead). More advanced things like parsing expression grammars can tolerate
more lookahead, but my impression is that even a PEG based grammar
struggles with wikitext's potential massive backtracking.
The third issue is that the proper meaning of a lot of wikitext is highly context dependent. For example, consider whitespace at the start of a line. In DWikiText, the standard meaning of this is preformatted text, but if we're in a list context it instead means a continued list entry. Similar issues can bedevil parsing various font change characters in running text, especially if you want to insist on proper nesting. I think that some of this can be handled naturally by things like parsing expression grammars, but I'm not sure that all of it can be.
(I read a paper from sweble.org that admitted they didn't handle determining what was and wasn't a font change in the parser but instead deferred all of that to a post-processing pass over the resulting AST. Maybe it's still possible to do this in clear, systematic code instead of gradually devolving back to ad-hoc processing.)