Wandering Thoughts archives

2013-01-31

Why you should support specifying comparison keys (and functions)

I've recently been taking another pass at optimizing DWiki's code a bit. In the process of profiling things, I noticed some code that was doing linear searches through an ordered list (for reasons that made sense at the time). What my code was doing just cried out for a better bisection-based approach, and it looked like I was in luck; not only does Python have a bisect module in the standard library but CPython actually has a C-level implementation for the best efficiency you could ask for.

Unfortunately I can't use it. You see, the people who created this module hard-coded everything to use direct comparisons so the code is full of expressions like 'if x < a[mid]: ...', and what I want to bisect through is not comparable this way. Like many people, I have a list where each element is itself a tuple and I want to search through the list based on one field of those tuples.

The bisect module has two failures here. The direct failure is that it missed a fundamental rule of general Python code: any time you write a searching or sorting routine, you should support specifying the key's field. Ideally you'd support a general comparison function as well. The .sort() method on lists is a good example (although Python 3 drops the comparison function).

The indirect failure is that this code has misunderstood what the fundamental data structures of Python are. The bisect module would sort of make sense if the fundamental data structure was an object; then your bisectable lists would be made up of objects and your objects would have comparisons defined on them and everything would be fine. But this is not the case. The fundamental data structures in Python are the primitive types; lists, tuples, and dictionaries. Almost no one bothers to make objects where one of these three will do. Any generic code that deals with ordered comparisons (searching, sorting, inserting in order in some data structure like a B-tree, and so on) should expect to be routinely handed these primitive types and for people want to order things based on a particular field in them; failing to handle this very common situation is a failure in your code.

(Well, this is half the reason to support specifying the key's field. The other half is that it's perfectly reasonable to want to order the same objects in several different ways depending on what you're doing with them at the moment.)

Sidebar: getting around the bisect module with a club

Suppose that you are very irritated with the bisect module's limitation here and are willing to commit evil hacks. Then here is basic code for something that I will call a virtual Schwartzian transformation:

class VirtualList(object):
    def __init__(self, lst, key):
        self.lst = lst
        self.key = key

    def __len__(self):
        return len(self.lst)

    def __getitem__(self, idx):
        return self.lst[idx][self.key]

Take your list of whatever-it-is you have, make a virtual list with 'vlst = VirtualList(lst, KEY)', and then bisect vlst; the index position you get back is valid for lst. KEY does not have to be a number; it works perfectly fine to have a list of dictionaries (or more complex objects). This only really makes sense for bisection, binary searching, and so on, where you can have a large list but should only ever look at a few elements from it.

You can extend this approach to support custom comparison functions by using a dynamic class and materializing instances on demand, but I'll leave that as an exercise for the sufficiently interested and perverse reader.

AllowComparisonKeys written at 02:19:02; Add Comment

2013-01-28

Some patterns for easy to parse Domain Specific Languages in Python

Suppose that you're creating a small DSL and a Python program that needs to parse it (something that I've wound up doing several times) and you want to make this as easy on yourself as possible. Python doesn't currently have a really good solution for easy to write and use parsers, so the easiest way to go is to write one by hand and in the process design your DSL so that it's as easy to parse as possible. My experiences in this have led me to a number of design patterns for both the parser and for the DSL.

First, organize the larger structures of your DSL so that it can be parsed in what is effectively a pipeline of steps. Two things help this a great deal: allowing comments on any source line and having a generic line continuation convention, one independent of the actual DSL statement you're dealing with. Given both of these you can create an initial line-reader function that reads raw lines, strips comments and blank lines, and folds continued lines into single source lines. You then feed each single source line up the pipeline to your real parser.

(I wrote up my approach to this here and here.)

The easiest DSL to parse in general is something that has no line to line state, where an entire directive or command or whatever is entirely contained on one (logical) line. This is where line continuations may come in handy. Unfortunately this generally isn't doable if you need multi-statement conditional logic or the like; try to avoid that if you can. If you can't, my opinion is that the easiest parser structure to write by hand is a recursive descent parser.

(It follows that declarative DSLs are often easier to deal with than procedural ones.)

The easiest sort of lines to parse in Python have two properties: they use whitespace separators between words (tokens) and they are parsed left to right. If you're dealing with a DSL that has no line to line state, this leads to a very simple parsing approach; you .split() the line, look at the first word, and dispatch to a particular command handler routine (giving it the remaining words). If you need slightly more sophisticated argument handling, you can still preserve most of this by having the 'command' word be whitespace separated from the rest of the line; you can then use '.split(None, 1)', look at the first word, and hand the remainder of the line to the particular command handler. Even if you have to use a recursive descent parser, whitespace separated words remain the simplest thing to lex.

(Some people will say that regexp based parsing is easier. My view is that this is only true if you have a relatively simple regexp that everything adheres to.)

When doing this, it's very convenient to make the equivalent of variable setting require an explicit keyword of some sort, even if it is just 'set x = y'; this allows you to directly dispatch to the assignment and/or expression handler instead of having to intuit that an otherwise unrecognized keyword followed by an = is a variable assignment. This is more verbose than otherwise, but if you want a very simple parser you need to optimize for the parser.

(And generally with a small DSL you will not be writing lots in the DSL anyways unless something odd is up.)

Note that you do not need a full recursive descent parser (and perhaps a lexer) for everything just to have expressions in your DSL. If you can isolate expressions off in some easily picked out contexts, you can parse most everything with the simple approach and only resort to RD parsing for actual expressions. I have done this and it worked fine.

Sometimes you really need interior whitespace in 'words' but you want to still be able to use .split(). What I tend to do is temporarily adopt another separator character or character sequence; I've often used ' : ' (space colon space) or ' | ' for this, as in:

bad-robots Yahoo! Slurp | Ask Jeeves/Teoma | msnbot/

This also illustrates the 'command then rest of line' pattern.

(Sadly, there are probably other useful patterns that I'm just not thinking of at the moment.)

EasyDSLParsers written at 22:52:11; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.