Some regular expression performance surprises

February 17, 2006

Part of the fun of performance tuning is all of the unexpected things you find out. Since I spent some of yesterday doing work in the guts of DWiki's DWikiText to HTML conversion, what I've been finding is some regular expression performance surprises.

My big surprise was the relative performance of '[A-Z]', '[A-Z][A-Z]*', and '[A-Z]+' for scanning text where they didn't match (the real character class is more complex). I expected them to be about the same speed; instead, the version using + is at least twice as slow.

Hopefully everyone reading this knows that the Perl regexp engine (which is more or less what's used in Python) checks '|' alternatives strictly left to right, one after the other. Among other things, this means that it is really worth your while to put common alternatives first.

Update: Daniel Martin has pointed out that Python doesn't actually use the Perl regexp engine, just the Perl regexp syntax and semantics. The CPython _sre regexp engine was independantly written by Secret Labs AB. My (bad) error.

But what I suspect is less well known is something I stumbled over some years ago, when looking at the performance of our Usenet anti-spam filters: constant alternatives are significantly slower than separate regular expressions (for non-matching searches, but I suspect that this holds for matching searches too).

Given:

re1 = re.compile("FOO BAR")
re2 = re.compile("BAZ BLORP")
re3 = re.compile("(?:FOO BAR|BAZ BLORP)")

It appears that current Pythons run 're1.search(tstr) or re2.search(tstr)' two or more times faster than 're3.search(tstr)'. The speed advantage grows as the size of the test string increases; using '(...)' instead of '(?:...)' performs no worse.

The results flip if the matches are not constant; '(?:[A-Z] FOO BAR|[A-Z] BAZ BLORF)' is faster than checking '[A-Z] FOO BAR' and '[A-Z] BAZ BLORF' separately. This suggests, not too surprisingly, that the regexp engine has highly tuned constant string matching code that is not triggered by plain string alternatives.

(And '[A-Z] (?:FOO BAR|BAZ BLORP)' is, unsurprisingly, faster still. Also, you can't try to cleverly convert the non-constant versions to constant ones by making the prefix a positive lookbehind assertion, eg '(?<=[A-Z] )FOO BAR'; in fact, you lose significant performance.)

This leads to the obvious moral, which is also the first rule of performance tuning: measure, don't guess.

(I'm still kind of bummed about the constant strings alternative performance drop, but I suppose if I was really annoyed I should be contributing patches to fix it.)

Written on 17 February 2006.
« An interesting IDE to SATA migration problem
Stupid web spider tricks »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Feb 17 02:40:48 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.