A bit on the performance of lexers in Python

July 19, 2013

This all starts with Eli Bendersky's Hand-written lexer in Javascript compared to the regex-based ones (via) where he writes in part:

I was expecting the runtime [of the hand written lexer] to be much closer to the single-regex version; in fact I was expecting it to be a bit slower (because most of the regex engine work is done at a lower level). But it turned out to be much faster, more than 2.5x.

In the comments Caleb Spare pointed to Rob Pike's Regular expressions in lexing and parsing which reiterates the arguments for simple lexers that don't use regular expressions. Despite all of this, regular expression based lexers are extremely common in the Python world.

Good lexing and even parsing algorithms are both extremely efficient and very well known (the problem has been studied almost since the start of computer science). A good high performance lexer generally looks at each input character only once and runs a relatively short amount of focused code per character to tokenize the input stream. A good regular expression engine can avoid backtracking but is almost invariably going to run more complex code (and often use more memory) to examine each character. As covered in Russ Cox's series on regular expressions, garden variety regular expression engines in Python, Perl, and several other languages aren't even that efficient and do backtrack (sometimes extensively).

So why don't people use hand-written lexers in Python? Because most Python implementations are slow. Your hand-written lexer may examine each character only once with simple code but that efficiency advantage is utterly crushed by the speed difference between the C-based regular expression engine and the slowness of interpreting your Python code. Hand written JavaScript lexers are comparatively fast because modern JavaScript interpreters have devoted a lot of effort to translating straightforward JavaScript code into efficient native code. Since the lexer actually is straightforward, a JIT engine is generally going to do quite well on it.

(Given this I wouldn't be surprised if a hand-written Python lexer that was run under PyPy was quite fast, either competitive with or even faster than a Python regex-based one. Assembling a test case and doing the benchmarking work is left as an exercise.)

Written on 19 July 2013.
« Thinking about the security issues in HTTP 403 versus 404 errors
A fun problem: monitoring randomness reduces it »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jul 19 20:31:00 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.