Some regular expression performance surprises
Part of the fun of performance tuning is all of the unexpected things you find out. Since I spent some of yesterday doing work in the guts of DWiki's DWikiText to HTML conversion, what I've been finding is some regular expression performance surprises.
My big surprise was the relative performance of '[A-Z]
',
'[A-Z][A-Z]*
', and '[A-Z]+
' for scanning text where they didn't
match (the real character class is more complex). I expected them to be
about the same speed; instead, the version using +
is at least twice
as slow.
Hopefully everyone reading this knows that the Perl regexp engine
(which is more or less what's used in Python) checks '|
' alternatives
strictly left to right, one after the other. Among other things, this
means that it is really worth your while to put common alternatives
first.
Update: Daniel Martin has pointed out that Python doesn't actually
use the Perl regexp engine, just the Perl regexp syntax and semantics.
The CPython _sre
regexp engine was independantly written by
Secret Labs AB. My (bad) error.
But what I suspect is less well known is something I stumbled over some years ago, when looking at the performance of our Usenet anti-spam filters: constant alternatives are significantly slower than separate regular expressions (for non-matching searches, but I suspect that this holds for matching searches too).
Given:
re1 = re.compile("FOO BAR") re2 = re.compile("BAZ BLORP") re3 = re.compile("(?:FOO BAR|BAZ BLORP)")
It appears that current Pythons run 're1.search(tstr) or
re2.search(tstr)
' two or more times faster than 're3.search(tstr)
'.
The speed advantage grows as the size of the test string increases;
using '(...)
' instead of '(?:...)
' performs no worse.
The results flip if the matches are not constant; '(?:[A-Z] FOO
BAR|[A-Z] BAZ BLORF)
' is faster than checking '[A-Z] FOO BAR
' and
'[A-Z] BAZ BLORF
' separately. This suggests, not too surprisingly,
that the regexp engine has highly tuned constant string matching code
that is not triggered by plain string alternatives.
(And '[A-Z] (?:FOO BAR|BAZ BLORP)
' is, unsurprisingly, faster still.
Also, you can't try to cleverly convert the non-constant versions to
constant ones by making the prefix a positive lookbehind assertion, eg
'(?<=[A-Z] )FOO BAR
'; in fact, you lose significant performance.)
This leads to the obvious moral, which is also the first rule of performance tuning: measure, don't guess.
(I'm still kind of bummed about the constant strings alternative performance drop, but I suppose if I was really annoyed I should be contributing patches to fix it.)
Comments on this page:
|
|