Some patterns for easy to parse Domain Specific Languages in Python
Suppose that you're creating a small DSL and a Python program that needs to parse it (something that I've wound up doing several times) and you want to make this as easy on yourself as possible. Python doesn't currently have a really good solution for easy to write and use parsers, so the easiest way to go is to write one by hand and in the process design your DSL so that it's as easy to parse as possible. My experiences in this have led me to a number of design patterns for both the parser and for the DSL.
First, organize the larger structures of your DSL so that it can be parsed in what is effectively a pipeline of steps. Two things help this a great deal: allowing comments on any source line and having a generic line continuation convention, one independent of the actual DSL statement you're dealing with. Given both of these you can create an initial line-reader function that reads raw lines, strips comments and blank lines, and folds continued lines into single source lines. You then feed each single source line up the pipeline to your real parser.
The easiest DSL to parse in general is something that has no line to line state, where an entire directive or command or whatever is entirely contained on one (logical) line. This is where line continuations may come in handy. Unfortunately this generally isn't doable if you need multi-statement conditional logic or the like; try to avoid that if you can. If you can't, my opinion is that the easiest parser structure to write by hand is a recursive descent parser.
(It follows that declarative DSLs are often easier to deal with than procedural ones.)
The easiest sort of lines to parse in Python have two properties: they
use whitespace separators between words (tokens) and they are parsed
left to right. If you're dealing with a DSL that has no line to line
state, this leads to a very simple parsing approach; you
the line, look at the first word, and dispatch to a particular command
handler routine (giving it the remaining words). If you need slightly
more sophisticated argument handling, you can still preserve most of
this by having the 'command' word be whitespace separated from the rest
of the line; you can then use '
.split(None, 1)', look at the first
word, and hand the remainder of the line to the particular command
handler. Even if you have to use a recursive descent parser, whitespace
separated words remain the simplest thing to lex.
(Some people will say that regexp based parsing is easier. My view is that this is only true if you have a relatively simple regexp that everything adheres to.)
When doing this, it's very convenient to make the equivalent of variable
setting require an explicit keyword of some sort, even if it is just
set x = y'; this allows you to directly dispatch to the assignment
and/or expression handler instead of having to intuit that an otherwise
unrecognized keyword followed by an
= is a variable assignment. This
is more verbose than otherwise, but if you want a very simple parser you
need to optimize for the parser.
(And generally with a small DSL you will not be writing lots in the DSL anyways unless something odd is up.)
Note that you do not need a full recursive descent parser (and perhaps a lexer) for everything just to have expressions in your DSL. If you can isolate expressions off in some easily picked out contexts, you can parse most everything with the simple approach and only resort to RD parsing for actual expressions. I have done this and it worked fine.
Sometimes you really need interior whitespace in 'words' but you want to
still be able to use
.split(). What I tend to do is temporarily adopt
another separator character or character sequence; I've often used '
' (space colon space) or '
| ' for this, as in:
bad-robots Yahoo! Slurp | Ask Jeeves/Teoma | msnbot/
This also illustrates the 'command then rest of line' pattern.
(Sadly, there are probably other useful patterns that I'm just not thinking of at the moment.)