Wandering Thoughts archives

2015-09-27

Wikitext not having errors creates a backtracking problem

In the past I've written some pain points of parsing wikitext and called out how there aren't conventional parsing errors in running wikitext, just things that turn out to be plain text instead of complete wikitext markup. Some of the consequences of this may not be obvious, and in fact they weren't obvious to me until I tried to make an overly ambitious change to DWiki's markup to HTML conversion code.

The obvious problem that 'no errors' creates is that you will have to either accept closing off incomplete markup or do lookahead to verify that you seem to have a complete entity, or both. If your markup denotes links as '[[ ... ]]', you probably want to look ahead for a ']]' before you start processing a '[[' as a link. Unfortunately doing lookahead correctly is quite hard if your wikitext permits various sorts of nested constructs. Consider DWikiText, which also has '(( ... ))' to quote uninterpreted text in a monospaced font, and then parsing the following:

This example [[looks like ((it has an ending ]])) but it doesn't.

Purely textual lookahead for a ']]' gets fooled here. So let's assume we're going to get fooled sooner or later and handle this better. Rather than trying to rely on fallible lookahead, if we reach the end of a paragraph with an unclosed entity we'll go back to the start of the entity and turn it into plain text.

Unfortunately this has problems too, because something retroactively becoming plain text may change the meaning of other text after that point. Consider this contrived example:

Lorem ipsum ((this *should be emphasis* because the '((' isn't closed and thus is plain text.

If you start out parsing the (( as real, the *'s are plain text. But once the (( is just plain text, they should be creating italics for emphasis. To really retroactively change the (( to plain text, you may need to backtrack all text processing since then and redo it. And backtracking is something conventional parsing technology is generally not designed for; in fact, conventional parsing technology usually avoids it like the plague (along with aggressive lookahead).

(I think the lookahead situation gets somewhat better if you look ahead in the token stream instead of in plain text, but it's still not great. You're basically parsing ahead of your actual parse, and you'd better keep both in sync. Backtracking your actual parsing is probably better.)

All of this has caused me to feel that parsing running wikitext in a single pass is not the best way to do it. Instead I have a multi-pass approach in mind (and have for some time), although I'm not entirely convinced it's right either. I probably won't know unless (and until) I actually implement it, which is probably unlikely.

(An alternate approach would be to simply have backtracking in a conventional recursive descent parser; every time you hit a 'parse error', the appropriate construct being parsed would turn its start token into plain text and continue the parsing from there. Unfortunately this feels like it could be vulnerable to pathological behavior, which is a potential issue for a parser that may be handed user-controlled input in the form of eg comments.)

PS: How I stubbed my toe on this issue was basically trying to do this sort of 'convert back to plain text' for general unclosed font changes in DWikiText. When I did this outside of a limited context, it blew up in my face.

programming/WikitextNoErrorsBacktracking written at 02:10:30; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.