Wandering Thoughts archives

2014-11-02

A drawback in how DWiki parses its wikitext

In my initial installment on how DWiki parses its wikitext I said that one important thing DWiki does is that it has two separate parsers:

[...] One parser handles embedded formatting in running text (things like fonts, links, and so on) and the other one handles all of the line oriented block level structures like paragraphs, headers, blockquotes, lists, etc. What makes it work is that the block level parser doesn't parse running text immediately for multi-line things like paragraphs; [...]

This sounds great and in general it is perfectly fine, but it does turn out to impose one restriction on your wiki dialect: it doesn't support block-level constructs that require looking ahead into running text. To work right, this requires that all block level constructs can be recognized before you have to start parsing running text, which means that they have to all come at the start of the line.

This doesn't sound like a particularly onerous restriction on your wikitext dialect, but it actually causes DWiki heartburn in one spot. In my wikitext dialect, definition lists are written as:

- first the <dt> text: And post-colon is the <dd> text.

This looks like a perfectly natural way to write a definition list entry, but phrased this way it requires block level parsing to look ahead into the line to recognize and find the ':' that separates the <dt> text from the <dd> text. Now suppose that you want to have a link to an outside website in the <dt> text, which of course is going to contain a ':' in the URL. Oops. Similar issues come up if you just want a : in the <dt> text for some reason. As a result DWiki's parsing of definition lists basically disallows a lot of stuff in the <dt> text, which has led me to not use them very much.

(The other problem with this definition is that it restricts the <dt> text to a single line.)

I think that this may also cause problems for natural looking tables. Most of the ways of writing natural tables are going to rely on interior whitespace to create visible columns and thus demonstrate that this is a table. Looking ahead in what would otherwise be running text to spot runs of whitespace is less dangerous than trying to find a character in a line, but it still breaks this pure separation.

(I didn't think of this issue when I wrote my first entry with its enthusiastic praise; sadly it's a corner case that's easy to forget about most of the time.)

Sidebar: DWiki's table parsing also sort of breaks the rule

DWiki doesn't need to look ahead in running text to know that it's processing a table, but it does have to peek into the running text to find column dividers. This is at least impure, but so far I think it's less annoying than the definition list case; in practice the column dividers don't seem to naturally occur in my table text so far. Still, it's not an easy problem and I'd like a better solution.

(One approach is to be able to tell the running text parser to stop if it runs into a certain character sequence in unquoted text. I think that this works best if you have an incremental parser for running text that can be fed input, parse it as much as possible, and then suspend itself to wait for more.)

programming/DWikiParsing02 written at 01:32:01; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.