Some regular expression performance surprises

February 17, 2006

Part of the fun of performance tuning is all of the unexpected things you find out. Since I spent some of yesterday doing work in the guts of DWiki's DWikiText to HTML conversion, what I've been finding is some regular expression performance surprises.

My big surprise was the relative performance of '[A-Z]', '[A-Z][A-Z]*', and '[A-Z]+' for scanning text where they didn't match (the real character class is more complex). I expected them to be about the same speed; instead, the version using + is at least twice as slow.

Hopefully everyone reading this knows that the Perl regexp engine (which is more or less what's used in Python) checks '|' alternatives strictly left to right, one after the other. Among other things, this means that it is really worth your while to put common alternatives first.

Update: Daniel Martin has pointed out that Python doesn't actually use the Perl regexp engine, just the Perl regexp syntax and semantics. The CPython _sre regexp engine was independantly written by Secret Labs AB. My (bad) error.

But what I suspect is less well known is something I stumbled over some years ago, when looking at the performance of our Usenet anti-spam filters: constant alternatives are significantly slower than separate regular expressions (for non-matching searches, but I suspect that this holds for matching searches too).

Given:

re1 = re.compile("FOO BAR")
re2 = re.compile("BAZ BLORP")
re3 = re.compile("(?:FOO BAR|BAZ BLORP)")

It appears that current Pythons run 're1.search(tstr) or re2.search(tstr)' two or more times faster than 're3.search(tstr)'. The speed advantage grows as the size of the test string increases; using '(...)' instead of '(?:...)' performs no worse.

The results flip if the matches are not constant; '(?:[A-Z] FOO BAR|[A-Z] BAZ BLORF)' is faster than checking '[A-Z] FOO BAR' and '[A-Z] BAZ BLORF' separately. This suggests, not too surprisingly, that the regexp engine has highly tuned constant string matching code that is not triggered by plain string alternatives.

(And '[A-Z] (?:FOO BAR|BAZ BLORP)' is, unsurprisingly, faster still. Also, you can't try to cleverly convert the non-constant versions to constant ones by making the prefix a positive lookbehind assertion, eg '(?<=[A-Z] )FOO BAR'; in fact, you lose significant performance.)

This leads to the obvious moral, which is also the first rule of performance tuning: measure, don't guess.

(I'm still kind of bummed about the constant strings alternative performance drop, but I suppose if I was really annoyed I should be contributing patches to fix it.)


Comments on this page:

By DanielMartin at 2006-02-17 07:12:21:

You should be explicit about what Python version you're using; many people will assume you mean Python 2.4, and I suspect that you don't.

Incidentally, my testing with Python 2.4 showed that regexp.match(matchstr) was about the same for the patterns xx* and x+ both when it matched and when it didn't, but that on regexp.search(matchstr) without a match at the beginning of matchstr, the pattern x+ was much slower than xx*.

Also, a minor quibble:

Hopefully everyone reading this knows that the Perl regexp engine (which is more or less what's used in Python) checks '|' alternatives strictly left to right, one after the other.

You don't really mean that. You mean to say:

Hopefully everyone reading this knows that the Perl regexp grammar (which is more or less what's used in Python) specifies that '|' alternatives match strictly left to right, in order.

The Python engine likely has nothing to do with the perl engine, though there may be some amount of borrowing back-and-forth in matching algorithms. Among other things, the python engine - in 2.4.1 at least - parses the regular expression entirely in python, and uses C only for the matching code. Also, the speed differences are staggering:

$ python retimetest.py
x+.search on a string with a match in the middle
time 9.353
xx*.search on a string with a match in the middle
time 3.906
x+.search on a string with a match at the beginning
time 3.495
xx*.search on a string with a match at the beginning
time 3.525
x+.search on a string without match
time 14.641
xx*.search on a string without match
time 1.852

$ perl retimetest.pl
x+.search on a string with a match in the middle
time 1.158567
xx*.search on a string with a match in the middle
time 1.241344
x+.search on a string with a match at the beginning
time 0.980984
xx*.search on a string with a match at the beginning
time 1.050801
x+.search on a string without a match
time 0.960931
xx*.search on a string without a match
time 0.970954

That's timing a million matches each. Tests done on a reasonably up-to-date cygwin install with python 2.4.1 and perl 5.8.7. The strings used were 'a' * 100 + 'x' * 20 + 'a' * 100, 'x' * 20 + 'a' * 100, and 'a' * 200, respectively.

I'll note that my intuition about how regular expressions should perform closely matches the perl results, but then I'm a perl programmer. Right-on about the moral, though. Always benchmark, profile, and otherwise measure performance - don't assume.

By cks at 2006-02-17 16:19:30:

The results hold for Python 2.3.3 (Fedora Core 2) and Python 2.4.2 on both a FreeBSD machine and Fedora Core 4. I didn't try to put in relative performance figures because I suspect that they're also getting effected by the different sorts of CPUs involved.

I'm not surprised that regexp .match() is fast; it shouldn't have to look at too many characters, since it's implicitly anchored. I am disappointed that the Python regexp engine is so much slower than the Perl one.

(I tried to figure out what was going on with +, but the engine made my head hurt. It looks like + is made into a very general operation that has a fair amount of overhead.)

Also, it looks like the constant strings and | performance issue is there in Perl as well as Python. If anything it seems to be stronger, perhaps because Perl has better optimizations for simple constant string matches, although I'm not entirely convinced I trust my test program right now.

By DanielMartin at 2006-02-18 08:14:44:

(I tried to figure out what was going on with +, but the engine made my head hurt. It looks like + is made into a very general operation that has a fair amount of overhead.)

Indeed, it's fairly clear from the parser that ?, +, *, {n}, and {n,m} are all compiled into the same structure. (Either a MAX_REPEAT pattern or a MIN_REPEAT pattern, depending on whether the match is greedy) What happens with that structure is not so clear, since the matching engine is written in what can only be described charitably as "obscure C". Actually, it might not be so bad if I were to sit down with a programmer who knew the engine and could be given a tour of the code, but that's not happening.

Also, note that the pattern x{1}x* performs exactly like the pattern x+ in my time tests. (and much slower than xx*) This points to the conclusion that there is some huge overhead in the MAX_REPEAT token when it is searching for a match.

Written on 17 February 2006.
« An interesting IDE to SATA migration problem
Stupid web spider tricks »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Fri Feb 17 02:40:48 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.