== Some regular expression performance surprises Part of the fun of performance tuning is all of the unexpected things you find out. Since I spent some of yesterday [[doing work ../web/CharacterProblems]] in the guts of DWiki's DWikiText to HTML conversion, what I've been finding is some regular expression performance surprises. My big surprise was the relative performance of '(([A-Z]))', '(([A-Z][A-Z]*))', and '(([A-Z]+))' for scanning text where they didn't match (the real character class is more complex). I expected them to be about the same speed; instead, the version using _+_ is at least twice as slow. Hopefully everyone reading this knows that the Perl regexp engine (which is more or less what's used in Python) checks '_|_' alternatives strictly left to right, one after the other. Among other things, this means that it is really worth your while to put common alternatives first. ~~Update~~: Daniel Martin has pointed out that Python doesn't actually use the Perl regexp engine, just the Perl regexp syntax and semantics. The CPython ((_sre)) regexp engine was independantly written by [[Secret Labs AB http://www.pythonware.com/]]. My (bad) error. But what I suspect is less well known is something I stumbled over some years ago, when looking at the performance of our Usenet anti-spam filters: ~~constant alternatives are significantly slower than separate regular expressions~~ (for non-matching searches, but I suspect that this holds for matching searches too). Given: > re1 = re.compile("FOO BAR") > re2 = re.compile("BAZ BLORP") > re3 = re.compile("(?:FOO BAR|BAZ BLORP)") It appears that current Pythons run '_re1.search(tstr) or re2.search(tstr)_' two or more times faster than '_re3.search(tstr)_'. The speed advantage grows as the size of the test string increases; using '_(...)_' instead of '_(?:...)_' performs no worse. The results flip if the matches are not constant; '_(?:[A-Z] FOO BAR|[A-Z] BAZ BLORF)_' is faster than checking '_[A-Z] FOO BAR_' and '_[A-Z] BAZ BLORF_' separately. This suggests, not too surprisingly, that the regexp engine has highly tuned constant string matching code that is not triggered by plain string alternatives. (And '_[A-Z] (?:FOO BAR|BAZ BLORP)_' is, unsurprisingly, faster still. Also, you can't try to cleverly convert the non-constant versions to constant ones by making the prefix a positive lookbehind assertion, eg '_(?<=[A-Z] )FOO BAR_'; in fact, you lose significant performance.) This leads to the obvious moral, which is also the first rule of performance tuning: ~~measure, don't guess~~. (I'm still kind of bummed about the constant strings alternative performance drop, but I suppose if I was really annoyed I should be contributing patches to fix it.)