Regular expression performance and performance folklore
I've spent a certain amount of time looking into the performance of the regular expression engines in both (some version(s) of) Perl and Python and reading things like Russ Cox's series on regular expressions. Like a lot of other people I've also read a certain amount of general wisdom on regular expression performance out on the Internet.
Here is what my direct experience has convinced me: I have no real
idea how a regular expression engine is going to perform in some
situation unless I measure it. The performance of regexp engines
is only vaguely predictable and to do it well you need to know an
uncommon amount about the internals of their implementations. Regexp
engines that use very similar or identical RE languages can perform
very differently on common things you do (the poster child here is the
behavior of Perl versus Python on RE alteration, '
All of this makes me very wary of airy Internet pronouncements about how to get fast regular expressions in any particular language (it should be clear that a language-independent prediction is either immediately obvious or completely laughable). These bits of advice may be well researched and thus actually correct or they may be what the author thinks should be the case, and it's often hard to tell which is which.
(Even when they're correct, regular expression engines can and do get changed over time. Hopefully the changes improve performance without any regressions, but you never know.)
I came very close to writing such an entry myself until I came to my senses. That's the other thing that this has convinced me of: I cannot possible write anything about this subject without actually measuring it. Per my aside in a recent entry I should also then archive and publish my measuring program.
(By the way, the other minefield for regexp engine performance is whether you are using them on Unicode data or 'plain bytes'. Many tests, mine included, have traditionally been run on plain bytes; these days Unicode performance is much more interesting and relevant, especially as some environments turn all normal strings into Unicode.)