python/RegexpPerformanceSurprises written at 02:40:48; Add Comment
Some regular expression performance surprises
Part of the fun of performance tuning is all of the unexpected things you find out. Since I spent some of yesterday doing work in the guts of DWiki's DWikiText to HTML conversion, what I've been finding is some regular expression performance surprises.
My big surprise was the relative performance of '
Hopefully everyone reading this knows that the Perl regexp engine
(which is more or less what's used in Python) checks '
Update: Daniel Martin has pointed out that Python doesn't actually
use the Perl regexp engine, just the Perl regexp syntax and semantics.
But what I suspect is less well known is something I stumbled over some years ago, when looking at the performance of our Usenet anti-spam filters: constant alternatives are significantly slower than separate regular expressions (for non-matching searches, but I suspect that this holds for matching searches too).
re1 = re.compile("FOO BAR") re2 = re.compile("BAZ BLORP") re3 = re.compile("(?:FOO BAR|BAZ BLORP)")
It appears that current Pythons run '
The results flip if the matches are not constant; '
This leads to the obvious moral, which is also the first rule of performance tuning: measure, don't guess.
(I'm still kind of bummed about the constant strings alternative performance drop, but I suppose if I was really annoyed I should be contributing patches to fix it.)
* * *
Atom feeds are available; see the bottom of most pages.