More on regular expression performance

February 20, 2006

My problem with benchmarking is that it's tedious (and finicky). Good benchmarking mostly consists of twiddling my thumbs while my tests run, often yet again after some minor tweak or improvement or additional case. I can kill days running tests and staring at numbers and not really get anywhere (and sometimes I've had to).

All that said, after my entry on some regular expression performance surprises, I got curious enough to do more detailed micro-benchmarking. With Daniel Martin's assistance I was able to check Perl as well as Python, and we turned up some surprises.

What I looked at was various versions of constant or near-constant alternates; the '(FOO BAR|BAZ BLORP)' case from my earlier entry. I'm primarily interested in the 'miss' case, as before, because it's typical for the sort of text scanning I do.

Things I've found:

  • Perl's regexp engine is significantly faster than Python's in most cases. This was especially apparent for simple constant text matches, where Perl was five to ten times faster (depending on the test machine).

  • a constant alternative with a prefix, ie 'FOO (BAR|BAZ)', runs much faster than just a constant alternative, even when the constant alternative itself has a constant prefix. This is especially so in Perl, because:

  • Perl is catastrophically bad at constant text alternatives; they run at least 90 times slower than the prefix alternate case, and significantly slower than in Python. This really surprised me; I had remembered the situation being bad, but not that bad.

  • Adding a constant text lookahead assertion to the constant alternative case (ie, '(?=FOO)(FOO BAR|FOO MORK)') doesn't make a huge improvement in Perl (at most halving the catastrophically bad time), and makes Python slower.

  • In Python, prefix alternates run as fast as plain text and faster than two separate regexps. In Perl it's a couple of times slower than plain text and a couple of regexps were still faster.

  • In Python, adding more constant alternatives seems to be costless, especially if they have a common prefix. (Python turns out to hoist common constant prefixes out of the alternates during regexp compilation.)
  • In Perl, each additional alternative exacts a noticeable penalty (although not a large one compared to the overall costs), more or less the same whether or not they have a common prefix.

  • Python is catastrophically worse at a pattern like 'F+OO (BAR|MORK)', managing to run up to 40 times slower than Perl.

Python compiles regexps in Python code, so you can look at this and examine several steps of the result. The most interesting thing to look at is probably the output of sre_parse.parse("<re string>"), which shows how the parser has broken things down and optimized them. (Apparently some additional optimization happens in the compiler too.)

I am a bit peeved at the performance differences between the Perl and Python regexp engines. It used to be that if a Python program was too slow, I could immediately conclude a Perl (or Ruby or etc) version would also be too slow, which simplified life. (Indeed, I could usually conclude that any regexp based version would be too slow, because I figured everyone had highly optimized regexp engines.)

Oh well. As I said myself, measure, don't guess.

(And thank you, Daniel Martin, for writing the Perl version of the benchmark tests.)

Sidebar: The experimental setup details

The test environments:

  1. Athlon XP 2100+, Fedora Core 2, Python 2.3.3, perl 5.8.3
  2. Intel Pentium 4 2.6GHz, Fedora Core 4, Python 2.4.2, perl 5.8.6

I've put up the rebench.py and rebench.pl test programs, if anyone wants to test themselves.


Comments on this page:

By DanielMartin at 2006-02-21 07:55:57:

The only thing I'd add to this is the actual tables of results themselves. It's one thing to say that a regexp of (A|B) performs poorly; it's another thing entirely to stare the two-orders-of-magnitude different numbers in the face.

Unfortunately, I probably won't have time to make a pretty html page out of the results since work is about to take over my life for the next week, but I think that having the numbers presented well would help.

By cks at 2006-02-23 13:45:56:

I thought about trying to put some sort of table of results in, but couldn't come up with a layout and a set of data to include that I really liked when I was writing the entry. Perhaps it really needs graphs, in which case DWiki needs to grow inline images or something.

My other concern is that some of the effects seem to be CPU and/or version dependant, and I felt nervous about throwing in exact numbers without either lots of qualifications (and possibly extra ones) or figuring out what was the important variable. (That's why the entry is loaded with qualifications like 'at least' or very large and vague ranges.)

From 84.58.50.182 at 2006-02-26 20:33:56:

The poor performance of alternations has been addressed to a certain extent in perl 5.9.3. Id really like to see the perl benchmarks redone using blead perl, as it would give a good idea of the speedup that can be expected in perl 5.10. It should be considerable (proportional to the number of alternations). There is also room for more improvement there (aho-corasick matching), but the hairyness of the engine got to me, and I havent got back into it to complete it. Also, you should add pure fail cases to your test suites, for short, medium, long and very long strings. I think perl will do even better in those cases than the others. With blead performing much better. -- demerphq

By cks at 2006-02-27 03:38:15:

I think I wasn't explicit enough in the entry (and the first entry): most of the measurements and ratios were for the pure miss / fail case. (Pure fail cases are important for spam scanners, among other things.)

The current rebench testbeds use strings about 22K long, for various fuzzy reasons including that I believe it's roughly typical for what I work with. (Also, I had to pick some length, because I wasn't interested enough to run a huge cross-matrix and analyze the results.)

From 67.169.137.35 at 2007-02-01 15:53:38:

Ok I'm coming across this long after you wrote it- but for some explanation of what's going on (and incidently, why grep's regular expression does not choke on this) - see http://swtch.com/~rsc/regexp/regexp1.html

Written on 20 February 2006.
« Irony in a Referer spammer
An annoyance with $PATH on Red Hat »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Feb 20 01:34:31 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.