More on regular expression performance
My problem with benchmarking is that it's tedious (and finicky). Good benchmarking mostly consists of twiddling my thumbs while my tests run, often yet again after some minor tweak or improvement or additional case. I can kill days running tests and staring at numbers and not really get anywhere (and sometimes I've had to).
All that said, after my entry on some regular expression performance surprises, I got curious enough to do more detailed micro-benchmarking. With Daniel Martin's assistance I was able to check Perl as well as Python, and we turned up some surprises.
What I looked at was various versions of constant or near-constant
alternates; the '
(FOO BAR|BAZ BLORP)' case from my earlier entry. I'm primarily interested in the 'miss'
case, as before, because it's typical for the sort of text scanning I
Things I've found:
- Perl's regexp engine is significantly faster than Python's in most
cases. This was especially apparent for simple constant text matches,
where Perl was five to ten times faster (depending on the test
- a constant alternative with a prefix, ie '
FOO (BAR|BAZ)', runs much faster than just a constant alternative, even when the constant alternative itself has a constant prefix. This is especially so in Perl, because:
- Perl is catastrophically bad at constant text alternatives;
they run at least 90 times slower than the prefix alternate case,
and significantly slower than in Python. This really surprised me;
I had remembered the situation being bad, but not that bad.
- Adding a constant text lookahead assertion to the constant
alternative case (ie, '
(?=FOO)(FOO BAR|FOO MORK)') doesn't make a huge improvement in Perl (at most halving the catastrophically bad time), and makes Python slower.
- In Python, prefix alternates run as fast as plain text and faster
than two separate regexps. In Perl it's a couple of times slower
than plain text and a couple of regexps were still faster.
- In Python, adding more constant alternatives seems to be costless, especially if they have a common prefix. (Python turns out to hoist common constant prefixes out of the alternates during regexp compilation.)
- In Perl, each additional alternative exacts a noticeable penalty
(although not a large one compared to the overall costs), more or
less the same whether or not they have a common prefix.
- Python is catastrophically worse at a pattern like
F+OO (BAR|MORK)', managing to run up to 40 times slower than Perl.
Python compiles regexps in Python code, so you can look at this and
examine several steps of the result. The most interesting thing to
look at is probably the output of
which shows how the parser has broken things down and optimized them.
(Apparently some additional optimization happens in the compiler too.)
I am a bit peeved at the performance differences between the Perl and Python regexp engines. It used to be that if a Python program was too slow, I could immediately conclude a Perl (or Ruby or etc) version would also be too slow, which simplified life. (Indeed, I could usually conclude that any regexp based version would be too slow, because I figured everyone had highly optimized regexp engines.)
Oh well. As I said myself, measure, don't guess.
(And thank you, Daniel Martin, for writing the Perl version of the benchmark tests.)
Sidebar: The experimental setup details
The test environments:
- Athlon XP 2100+, Fedora Core 2, Python 2.3.3, perl 5.8.3
- Intel Pentium 4 2.6GHz, Fedora Core 4, Python 2.4.2, perl 5.8.6