Regular expression performance and performance folklore

September 15, 2013

I've spent a certain amount of time looking into the performance of the regular expression engines in both (some version(s) of) Perl and Python and reading things like Russ Cox's series on regular expressions. Like a lot of other people I've also read a certain amount of general wisdom on regular expression performance out on the Internet.

Here is what my direct experience has convinced me: I have no real idea how a regular expression engine is going to perform in some situation unless I measure it. The performance of regexp engines is only vaguely predictable and to do it well you need to know an uncommon amount about the internals of their implementations. Regexp engines that use very similar or identical RE languages can perform very differently on common things you do (the poster child here is the behavior of Perl versus Python on RE alteration, '|').

All of this makes me very wary of airy Internet pronouncements about how to get fast regular expressions in any particular language (it should be clear that a language-independent prediction is either immediately obvious or completely laughable). These bits of advice may be well researched and thus actually correct or they may be what the author thinks should be the case, and it's often hard to tell which is which.

(Even when they're correct, regular expression engines can and do get changed over time. Hopefully the changes improve performance without any regressions, but you never know.)

I came very close to writing such an entry myself until I came to my senses. That's the other thing that this has convinced me of: I cannot possible write anything about this subject without actually measuring it. Per my aside in a recent entry I should also then archive and publish my measuring program.

(By the way, the other minefield for regexp engine performance is whether you are using them on Unicode data or 'plain bytes'. Many tests, mine included, have traditionally been run on plain bytes; these days Unicode performance is much more interesting and relevant, especially as some environments turn all normal strings into Unicode.)

Written on 15 September 2013.
« Identities, trust, and work
The pain (or annoyance) of deploying a simple WSGI thing »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Sep 15 23:28:16 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.