Wandering Thoughts archives


When Python regexp alternation is faster than separate regexps

In Eli Bendersky's Regex-based lexical analysis in Python and Javascript he wrote:

Shortly after I started using it, it was suggested that combining all the regexes into a single regex with alternation (the | regex syntax) and using named groups to know which one matched would make the lexer faster. [...]

This optimization makes the lexer more than twice as fast! [...]

At first I was going to write an entry explicitly noting that Bendersky had demonstrated that you should use regex alternation instead of multiple regexps. Then I reread my old entries on regexp performance and even reran my test program to see if Python 2 (or 3) had changed the picture and now I need to write a more complicated entry.

I will give you the quick summary: if you are using .match() or some other form of explicitly anchored search, using regexp alternation is faster than separate regular expressions. If you are using an unrestricted .search() the situation is much murkier; it really depends on how many regexps you have and possibly what you're searching for. If you have a decent number of alternates it's probably faster to use real regexp alternation. If you have only two or three it's quite possible that using separate regexps will be a win.

Eli Bendersky's lexer is faster with regexp alternation for both reasons; it uses .match() and has a number of alternatives.

Current Python 2 and 3 regexp performance for .search() appears basically unchanged from my earlier entries (although Python 3.3.0 does noticeably worse than Python 2 on the same machine). In particular the number of regexp alternatives you use continues to have very little to no effect on the performance of the resulting regular expression. There is a performance penalty for .search() with regexp alternation, even if the first alternate matches (and even if it matches at the front of the string).

(It appears that you pay a small penalty for .match() if you use regexp alternation but this penalty is utterly dwarfed by the penalty of repeatedly calling .match() for separate regexps. If you need alternatives with .match() you should use regexp alternation.)

PS: my test program continues to be available if you want to run it in your own Python environment; see the link in this entry. I have to say that preserving test programs is a great idea that I should do much more often.

(Test programs are often ugly quick hacks. But being able to rerun the exact same tests much later can be extremely useful, as is being able to see exactly what the tests were.)

python/RegexpAlternationWhen written at 00:38:12; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.