Why parsers matter
October 22, 2012
One reaction to my problem with parsing wikitext is to ask why it matters. So what if I don't know how to properly parse WikiText; clearly lots of people (myself included) have written code to process wikitext dialects and the code works, even if it may not be the most elegant code and we may not have a nice formal model for it. If regexp based bashing works, just use regexp based bashing.
My answer is that parsing is ultimately here to tell us how to easily write code that handles some (computer) language. Things like lexers, recursive descent parsers, and so on reduce potentially hairy problems to known and readily implemented patterns; once you know the general tricks, the details for your particular language are usually just vaguely tedious. This has been a huge boon for our practical ability to process all sorts of computer 'languages'.
(In the process, parser theory has often delivered parsers that are efficient as well as easily written.)
This isn't just an issue of getting something that works, because you can get there with bashing things together repeatedly. There's two additional issues: getting your parser reliable and evolving it over time. By reliability, I don't just mean not crashing; for parsers, 'reliability' means that it correctly parses everything it should and doesn't accept anything it shouldn't. Reliability feeds into evolution over time, because ad hoc parsers are often fragile (as well as hard to work with); changes in one place (to change one aspect of what the parser handles) can have unintended consequences for what the parser accepts in general.
(Consider a really complicated regular expression, such as some of the monster ones that people use to recognize all or almost all valid forms of email addresses. These are so complex that it's very difficult to see how they work, what exactly they recognize, and how to change them to add another form of address without everything exploding.)
In short and as I put it succinctly in ParsingWikitext, parsers matter because they're much easier to write, to maintain, and to reason about than ad-hoc code; ad-hoc processing code is ultimately just a big pile of complex, fragile mud.
(Note that the difficulty of fully reasoning about an ad hoc parser is a large part of why it's hard to get them reliable and to change them over time. When you can't really fully understand the parser and what it will do, you have problems. And complex ad hoc parsing is almost always quite hard to understand.)
Every ad hoc parser that I have to write is a weak point in my code. Not only was it hard work to create, but it's something that's hard to reason about, hard to completely understand, and hard to change. All of those make me nervous and that's ultimately why I want to find how to parse these things properly.
* * *
Atom feeds are available; see the bottom of most pages.