Why I use separate lexers in my recursive descent parsers

May 14, 2023

I recently read Laurence Tratt's Why Split Lexing and Parsing Into Two Separate Phases?, which answers the question its title asked. In the introduction, Tratt offhandedly remarks:

[...] Since some parsing approaches such as recursive descent parsing unify these phases, why do lex and yacc split them apart?

As someone who's written a number of recursive descent (RD) parsers over the years, I had a reaction on the Fediverse. You see, when I write recursive descent parsers, I always write a separate lexer.

The reason why is pretty straightforward; I find it simpler to have a separate lexer. In the kind of little languages that I wind up writing (RD) parsers for, the lexer is generally simple and straightforward, often implemented with either a few regular expressions or some state machines (sometimes nested), and operates by breaking the input text into tokens. The parser then gets the simplification of dealing only with tokens without having to concern itself with exactly how they're represented in text.

(I have a general process for writing a recursive descent parser that goes well with this separation, but that in no way fits into the margins of this entry; here's the short version.)

One slightly subtle advantage of the split is that it's easier to test a lexer on its own, and in turn this tends to make it somewhat easier to test the parser. A fairly simple set of tests (for example) can give me pretty good lexer coverage and protect me against making stupid mistakes. Then testing and debugging the parser becomes much simpler because I don't have to worry that what looks like a parsing mistake is actually a lexing mistake.

(Typically I cheat by writing parser tests with the assumption that the lexer is working correctly, so I can get the token input for the parser by lexing (test) text. This is impure but much simpler than hand-crafting token sequences.)

Although this feels like the obvious, simplest approach to me, it turns out that it seems to be far from universal. Obviously we have Tratt's experiences (and Tratt has much more of it than me), plus my Fediverse poll said that over half the people who write RD parsers use integrated lexing at least some of the time, and some of them always do it.

Written on 14 May 2023.
« The paradox of ZFS ARC non-growth and ARC hit rates
The time our Linux systems spend on integer to text and back conversions »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun May 14 22:29:13 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.