Some patterns for easy to parse Domain Specific Languages in Python

January 28, 2013

Suppose that you're creating a small DSL and a Python program that needs to parse it (something that I've wound up doing several times) and you want to make this as easy on yourself as possible. Python doesn't currently have a really good solution for easy to write and use parsers, so the easiest way to go is to write one by hand and in the process design your DSL so that it's as easy to parse as possible. My experiences in this have led me to a number of design patterns for both the parser and for the DSL.

First, organize the larger structures of your DSL so that it can be parsed in what is effectively a pipeline of steps. Two things help this a great deal: allowing comments on any source line and having a generic line continuation convention, one independent of the actual DSL statement you're dealing with. Given both of these you can create an initial line-reader function that reads raw lines, strips comments and blank lines, and folds continued lines into single source lines. You then feed each single source line up the pipeline to your real parser.

(I wrote up my approach to this here and here.)

The easiest DSL to parse in general is something that has no line to line state, where an entire directive or command or whatever is entirely contained on one (logical) line. This is where line continuations may come in handy. Unfortunately this generally isn't doable if you need multi-statement conditional logic or the like; try to avoid that if you can. If you can't, my opinion is that the easiest parser structure to write by hand is a recursive descent parser.

(It follows that declarative DSLs are often easier to deal with than procedural ones.)

The easiest sort of lines to parse in Python have two properties: they use whitespace separators between words (tokens) and they are parsed left to right. If you're dealing with a DSL that has no line to line state, this leads to a very simple parsing approach; you .split() the line, look at the first word, and dispatch to a particular command handler routine (giving it the remaining words). If you need slightly more sophisticated argument handling, you can still preserve most of this by having the 'command' word be whitespace separated from the rest of the line; you can then use '.split(None, 1)', look at the first word, and hand the remainder of the line to the particular command handler. Even if you have to use a recursive descent parser, whitespace separated words remain the simplest thing to lex.

(Some people will say that regexp based parsing is easier. My view is that this is only true if you have a relatively simple regexp that everything adheres to.)

When doing this, it's very convenient to make the equivalent of variable setting require an explicit keyword of some sort, even if it is just 'set x = y'; this allows you to directly dispatch to the assignment and/or expression handler instead of having to intuit that an otherwise unrecognized keyword followed by an = is a variable assignment. This is more verbose than otherwise, but if you want a very simple parser you need to optimize for the parser.

(And generally with a small DSL you will not be writing lots in the DSL anyways unless something odd is up.)

Note that you do not need a full recursive descent parser (and perhaps a lexer) for everything just to have expressions in your DSL. If you can isolate expressions off in some easily picked out contexts, you can parse most everything with the simple approach and only resort to RD parsing for actual expressions. I have done this and it worked fine.

Sometimes you really need interior whitespace in 'words' but you want to still be able to use .split(). What I tend to do is temporarily adopt another separator character or character sequence; I've often used ' : ' (space colon space) or ' | ' for this, as in:

bad-robots Yahoo! Slurp | Ask Jeeves/Teoma | msnbot/

This also illustrates the 'command then rest of line' pattern.

(Sadly, there are probably other useful patterns that I'm just not thinking of at the moment.)

Comments on this page:

From at 2013-01-29 11:48:59:

Python doesn't currently have a really good solution for easy to write and use parsers

Oh? Hard to believe for me. I assume you know PLY already. What's wrong with it? Except maybe for the fact that it seems to require being installed for using the parser, while yacc-like tools typically generate standalone parsing engine.


From at 2013-01-30 17:12:43:

Rather than just .split() you could use shlex.split() which solves the whitespace problem using a syntax that should be familiar to most people (POSIX shell).

By cks at 2013-02-07 18:26:42:

Department of belated replies: the honest answer is that I've never looked at PLY very much. Part of this is that it both has to be installed at runtime and generates the parser et al at runtime, and part of it is just my feeling that a full blown Yacc-level parser is overkill for simple languages (which may be a mistake).

I should really play around with it someday so I can put together an informed opinion on it.

Written on 28 January 2013.
« How the modern web 2.0 social web irritates me by hiding discussions
Thinking about FreeBSD versus Illumos for our ZFS fileservers »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 28 22:52:11 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.