Wandering Thoughts archives

2013-07-31

A Python code structure problem: exception handling with flags

Here's a Python (2.x) code structure puzzle that I don't have a good answer for yet, except that maybe the answer is that my overall design is a bad fit for what I'm doing. To start with, suppose that you have a multi-level, multi-step process of processing lines from an input file. Any number of things can go wrong during the processing; when it does, you need to bubble this information up to the top level but keep on going to process the next line (if only so you can report all of the errors in the file in one pass). The obvious fit for this is to have errors communicated by raising exceptions which are then trapped at the top level.

Now let's suppose there are several different sorts of errors and you want to treat some of them specially based on command line flags. For example normally all errors are fatal and show error messages, but some can be suppressed entirely with a flag (they just cause the record to be silently skipped) and some can be made into warnings. How do you structure this in the code?

My first version looked something like this:

try:
   data = crunch(line)
   ....
except A, e:
   report(e)
   commit = False
except B, e:
   report(e)
   if not option.skipempty:
      commit = False
except C, e:
   if not option.skipdups:
      report(e)
      commit = False

All of the duplication here made me unhappy because it obscured the actual logic and makes it easy for one exception to drift out of sync with the handling for the others. I can aggregate everything together with 'except (A, B, C), e:' but then the question is how to write the single exception handler so that it's both clean and does everything necessary; so far I've thought of two approaches. The first approach is to use isinstance() on e to tell what sort of exception we have and then write out the conditions in if's, except that trying to do that makes for ugly long conditions.

(I started to write out the example above and basically exploded in irritation when I got to the commit logic, which I decided was a bad sign. It also looked like the result would be very hard to read, which means that errors would be easy to add.)

The second solution I've come up with is to add attributes to each exception class, call them report and nocommit. Then at the start of the code we do:

if options.skipempty:
   B.nocommit = False
if options.skipdups:
   C.report = False
   C.nocommit = False

In the main code we do:

try:
   ....
except (A, B, C), e:
   if e.report:
      report(e)
   if e.nocommit:
      commit = False

This avoids both duplication and lack of clarity at the expense of, well, kind of being a hack.

(You can also code a variant of this where report and nocommit are functions that are passed the options object; this puts all of the 'what turns this off' logic into the exceptions themselves instead of reaching into the exception classes to (re)set attributes. That might be somewhat cleaner although it's more verbose.)

Given that all of the options have drawbacks I feel that there ought to be a clever Python code design trick that I'm missing here.

ExceptionHandlingWithFlags written at 00:38:13; Add Comment

2013-07-23

When Python regexp alternation is faster than separate regexps

In Eli Bendersky's Regex-based lexical analysis in Python and Javascript he wrote:

Shortly after I started using it, it was suggested that combining all the regexes into a single regex with alternation (the | regex syntax) and using named groups to know which one matched would make the lexer faster. [...]

This optimization makes the lexer more than twice as fast! [...]

At first I was going to write an entry explicitly noting that Bendersky had demonstrated that you should use regex alternation instead of multiple regexps. Then I reread my old entries on regexp performance and even reran my test program to see if Python 2 (or 3) had changed the picture and now I need to write a more complicated entry.

I will give you the quick summary: if you are using .match() or some other form of explicitly anchored search, using regexp alternation is faster than separate regular expressions. If you are using an unrestricted .search() the situation is much murkier; it really depends on how many regexps you have and possibly what you're searching for. If you have a decent number of alternates it's probably faster to use real regexp alternation. If you have only two or three it's quite possible that using separate regexps will be a win.

Eli Bendersky's lexer is faster with regexp alternation for both reasons; it uses .match() and has a number of alternatives.

Current Python 2 and 3 regexp performance for .search() appears basically unchanged from my earlier entries (although Python 3.3.0 does noticeably worse than Python 2 on the same machine). In particular the number of regexp alternatives you use continues to have very little to no effect on the performance of the resulting regular expression. There is a performance penalty for .search() with regexp alternation, even if the first alternate matches (and even if it matches at the front of the string).

(It appears that you pay a small penalty for .match() if you use regexp alternation but this penalty is utterly dwarfed by the penalty of repeatedly calling .match() for separate regexps. If you need alternatives with .match() you should use regexp alternation.)

PS: my test program continues to be available if you want to run it in your own Python environment; see the link in this entry. I have to say that preserving test programs is a great idea that I should do much more often.

(Test programs are often ugly quick hacks. But being able to rerun the exact same tests much later can be extremely useful, as is being able to see exactly what the tests were.)

RegexpAlternationWhen written at 00:38:12; Add Comment

2013-07-19

A bit on the performance of lexers in Python

This all starts with Eli Bendersky's Hand-written lexer in Javascript compared to the regex-based ones (via) where he writes in part:

I was expecting the runtime [of the hand written lexer] to be much closer to the single-regex version; in fact I was expecting it to be a bit slower (because most of the regex engine work is done at a lower level). But it turned out to be much faster, more than 2.5x.

In the comments Caleb Spare pointed to Rob Pike's Regular expressions in lexing and parsing which reiterates the arguments for simple lexers that don't use regular expressions. Despite all of this, regular expression based lexers are extremely common in the Python world.

Good lexing and even parsing algorithms are both extremely efficient and very well known (the problem has been studied almost since the start of computer science). A good high performance lexer generally looks at each input character only once and runs a relatively short amount of focused code per character to tokenize the input stream. A good regular expression engine can avoid backtracking but is almost invariably going to run more complex code (and often use more memory) to examine each character. As covered in Russ Cox's series on regular expressions, garden variety regular expression engines in Python, Perl, and several other languages aren't even that efficient and do backtrack (sometimes extensively).

So why don't people use hand-written lexers in Python? Because most Python implementations are slow. Your hand-written lexer may examine each character only once with simple code but that efficiency advantage is utterly crushed by the speed difference between the C-based regular expression engine and the slowness of interpreting your Python code. Hand written JavaScript lexers are comparatively fast because modern JavaScript interpreters have devoted a lot of effort to translating straightforward JavaScript code into efficient native code. Since the lexer actually is straightforward, a JIT engine is generally going to do quite well on it.

(Given this I wouldn't be surprised if a hand-written Python lexer that was run under PyPy was quite fast, either competitive with or even faster than a Python regex-based one. Assembling a test case and doing the benchmarking work is left as an exercise.)

PythonLexerPerformance written at 20:31:00; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.