A fun little regular expression bug
The problem with regular expressions is the same problem as computer programming in general: the computer will faithfully do exactly what you told it to do, regardless of whether or not this was what you actually wanted.
(The other problem with regular expressions is that they are a crappy programming language, not in power (they've got plenty of power) but in terms of being able to read and write them. (Okay, one of the other problems with regexps. There are more.))
DWiki had an interesting bug recently that perfectly illustrates this. As part of parsing DWikiText, I need to recognize things that look like this:
- foo bar: more text
The simple regular expression to match this and to group the 'foo bar' and the 'more text' bits is:
^-\s+(.+):\s+(.+)
However, this matches too much if you have a 'blah: more' in the 'more text' bit, because regular expressions are greedy. Fine, no problem:
^-\s+([^:]+):\s+(.+)
And that's what DWiki had for a long time. But the other day I discovered that this has a bug; it doesn't match the following:
- [[foo http://bar]]: baz.
This is because the regular expression as written requires that the 'foo bar' portion not merely have no ': ' bits in it, but that it have no colons in it at all. (If it does have a ':' but no following space, the group ends but the following ': ' required by the regexp isn't there.)
My quick on the spot solution was to club in an exception:
^-\s+(([^:]|:(?!\s))+):\s+(.+)
(This says that the first group is allowed to contain colons, as long as they are not followed by a space. Yes, I could probably write that as ':\S' instead. I was in a hurry; I count it lucky that I wrote '\s' instead of just ' '.)
I did it this way because I was very much in a quick fix, patch the existing regexp mode at the time. In retrospect, the whole thing might be better written using a non-greedy regexp match:
^-\s+(.+?):\s+(.+)
Of course, it would be nice if I could be confidant that this performs as well as the other version, but as I've found out before I can't be unless I measure it.
(Disclaimer: the regexps are slightly simplified from the real ones that DWiki uses, because the real ones are encrusted with some internal concerns that obfuscate them a bit.)
|
|