A fun little regular expression bug

August 6, 2006

The problem with regular expressions is the same problem as computer programming in general: the computer will faithfully do exactly what you told it to do, regardless of whether or not this was what you actually wanted.

(The other problem with regular expressions is that they are a crappy programming language, not in power (they've got plenty of power) but in terms of being able to read and write them. (Okay, one of the other problems with regexps. There are more.))

DWiki had an interesting bug recently that perfectly illustrates this. As part of parsing DWikiText, I need to recognize things that look like this:

- foo bar: more text

The simple regular expression to match this and to group the 'foo bar' and the 'more text' bits is:

^-\s+(.+):\s+(.+)

However, this matches too much if you have a 'blah: more' in the 'more text' bit, because regular expressions are greedy. Fine, no problem:

^-\s+([^:]+):\s+(.+)

And that's what DWiki had for a long time. But the other day I discovered that this has a bug; it doesn't match the following:

- [[foo http://bar]]: baz.

This is because the regular expression as written requires that the 'foo bar' portion not merely have no ': ' bits in it, but that it have no colons in it at all. (If it does have a ':' but no following space, the group ends but the following ': ' required by the regexp isn't there.)

My quick on the spot solution was to club in an exception:

^-\s+(([^:]|:(?!\s))+):\s+(.+)

(This says that the first group is allowed to contain colons, as long as they are not followed by a space. Yes, I could probably write that as ':\S' instead. I was in a hurry; I count it lucky that I wrote '\s' instead of just ' '.)

I did it this way because I was very much in a quick fix, patch the existing regexp mode at the time. In retrospect, the whole thing might be better written using a non-greedy regexp match:

^-\s+(.+?):\s+(.+)

Of course, it would be nice if I could be confidant that this performs as well as the other version, but as I've found out before I can't be unless I measure it.

(Disclaimer: the regexps are slightly simplified from the real ones that DWiki uses, because the real ones are encrusted with some internal concerns that obfuscate them a bit.)

Written on 06 August 2006.
« Weekly spam summary on August 5th, 2006
A problem in Python's implementation of closures »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Aug 6 13:00:21 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.