My problem with benchmarking is that it's tedious (and finicky). Good
benchmarking mostly consists of twiddling my thumbs while my tests run,
often yet again after some minor tweak or improvement or additional
case. I can kill days running tests and staring at numbers and not really
get anywhere (and sometimes I've had to).
All that said, after my entry on some regular expression performance
surprises, I got curious enough
to do more detailed micro-benchmarking. With Daniel Martin's assistance I was able to check Perl as
well as Python, and we turned up some surprises.
What I looked at was various versions of constant or near-constant
alternates; the '
(FOO BAR|BAZ BLORP)' case from my earlier entry. I'm primarily interested in the 'miss'
case, as before, because it's typical for the sort of text scanning I
Things I've found:
- Perl's regexp engine is significantly faster than Python's in most
cases. This was especially apparent for simple constant text matches,
where Perl was five to ten times faster (depending on the test
- a constant alternative with a prefix, ie '
FOO (BAR|BAZ)', runs
much faster than just a constant alternative, even when the constant
alternative itself has a constant prefix. This is especially so in
- Perl is catastrophically bad at constant text alternatives;
they run at least 90 times slower than the prefix alternate case,
and significantly slower than in Python. This really surprised me;
I had remembered the situation being bad, but not that bad.
- Adding a constant text lookahead assertion to the constant
alternative case (ie, '
(?=FOO)(FOO BAR|FOO MORK)') doesn't make a
huge improvement in Perl (at most halving the catastrophically bad
time), and makes Python slower.
- In Python, prefix alternates run as fast as plain text and faster
than two separate regexps. In Perl it's a couple of times slower
than plain text and a couple of regexps were still faster.
- In Python, adding more constant alternatives seems to be costless,
especially if they have a common prefix. (Python turns out to
hoist common constant prefixes out of the alternates during regexp
- In Perl, each additional alternative exacts a noticeable penalty
(although not a large one compared to the overall costs), more or
less the same whether or not they have a common prefix.
- Python is catastrophically worse at a pattern like
F+OO (BAR|MORK)', managing to run up to 40 times slower than
Python compiles regexps in Python code, so you can look at this and
examine several steps of the result. The most interesting thing to
look at is probably the output of
which shows how the parser has broken things down and optimized them.
(Apparently some additional optimization happens in the compiler too.)
I am a bit peeved at the performance differences between the Perl and
Python regexp engines. It used to be that if a Python program was too
slow, I could immediately conclude a Perl (or Ruby or etc) version
would also be too slow, which simplified life. (Indeed, I could usually
conclude that any regexp based version would be too slow, because I
figured everyone had highly optimized regexp engines.)
Oh well. As I said myself, measure, don't guess.
(And thank you, Daniel Martin, for writing the Perl version
of the benchmark tests.)
Sidebar: The experimental setup details
The test environments:
- Athlon XP 2100+, Fedora Core 2, Python 2.3.3, perl 5.8.3
- Intel Pentium 4 2.6GHz, Fedora Core 4, Python 2.4.2, perl 5.8.6
I've put up the rebench.py and
rebench.pl test programs, if
anyone wants to test themselves.