2013-07-23
When Python regexp alternation is faster than separate regexps
In Eli Bendersky's Regex-based lexical analysis in Python and Javascript he wrote:
Shortly after I started using it, it was suggested that combining all the regexes into a single regex with alternation (the | regex syntax) and using named groups to know which one matched would make the lexer faster. [...]
This optimization makes the lexer more than twice as fast! [...]
At first I was going to write an entry explicitly noting that Bendersky had demonstrated that you should use regex alternation instead of multiple regexps. Then I reread my old entries on regexp performance and even reran my test program to see if Python 2 (or 3) had changed the picture and now I need to write a more complicated entry.
I will give you the quick summary: if you are using .match()
or some
other form of explicitly anchored search, using regexp alternation
is faster than separate regular expressions. If you are using an
unrestricted .search()
the situation is much murkier; it really
depends on how many regexps you have and possibly what you're searching
for. If you have a decent number of alternates it's probably faster to
use real regexp alternation. If you have only two or three it's quite
possible that using separate regexps will be a win.
Eli Bendersky's lexer is faster with regexp alternation for both
reasons; it uses .match()
and has a number of alternatives.
Current Python 2 and 3 regexp performance for .search()
appears
basically unchanged from my earlier entries (although Python 3.3.0 does
noticeably worse than Python 2 on the same machine). In particular the
number of regexp alternatives you use continues to have very little to
no effect on the performance of the resulting regular expression. There
is a performance penalty for .search()
with regexp alternation, even
if the first alternate matches (and even if it matches at the front of
the string).
(It appears that you pay a small penalty for .match()
if you use
regexp alternation but this penalty is utterly dwarfed by the penalty
of repeatedly calling .match()
for separate regexps. If you need
alternatives with .match()
you should use regexp alternation.)
PS: my test program continues to be available if you want to run it in your own Python environment; see the link in this entry. I have to say that preserving test programs is a great idea that I should do much more often.
(Test programs are often ugly quick hacks. But being able to rerun the exact same tests much later can be extremely useful, as is being able to see exactly what the tests were.)