Wandering Thoughts archives

2014-10-31

A drawback to handling errors via exceptions

Recently I discovered an interesting and long standing bug in DWiki. DWiki is essentially a mature program, so this one was uncovered through the common mechanism of someone using invalid input, in this case a specific sort of invalid URL. DWiki creates time-based views of this blog through synthetic parts of the URLs that end in things like, for example, '.../2014/10/' for entries from October 2014. Someone came along and requested a URL that looked like '.../2014/99/', and DWiki promptly hit an uncaught Python exception (well, technically it was caught and logged by my general error code).

(A mature program usually doesn't have bugs handling valid input, even uncommon valid input. But the many forms of invalid input are often much less well tested.)

To be specific, it promptly coughed up:

calendar.IllegalMonthError: bad month number 99; must be 1-12

Down in the depths of the code that handled a per-month view I was calling calendar.monthrange() to determine how many days a given month has, which was throwing an exception because '99' is of course not a valid month of the year. The exception escaped because I wasn't doing anything in my code to either catch it or not let invalid months get that far in the code.

The standard advantage of handling errors via exceptions definitely applied here. Even though I had totally overlooked this error possibility, the error did not get quietly ignored and go on to corrupt further program state; instead I got smacked over the nose with the existence of this bug so I could find it and fix it. But it also exposes a drawback of handling errors with exceptions, which is that it makes it easier to overlook the possibility of errors because that possibility isn't explicit.

The calendar module doesn't document what exceptions it raises, either in general or especially in the documentation for monthrange() in specific (where it would be easy to spot while reading about the function). Because an exception is effectively an implicit extra return 'value' from functions, it's easy to overlook the possibility that you'll actually get an exception; in Python, there's nothing there to rub your nose in it and make you think about it. And so I never even thought about what happened if monthrange() was handed invalid input, in part because of the usual silent assumption that the code would only be called with valid input because of course DWiki doesn't generate date range URLs with bad months in them.

Explicit error returns may require a bunch of inconvenient work to handle them individually instead of letting you aggregate exception handling together, but the mere presence of an explicit error return in a method's or function's signature serves as a reminder that yes, the function can fail and so you need to handle it. Exceptions for errors are more convenient and more safe for at least casual programming, but they do mean you need to ask yourself what-if questions on a regular basis (here, 'what if the month is out of range?').

(It turns out I've run into this general issue before, although that time the documentation had a prominent notice that I just ignored. The general issue of error handling with exceptions versus explicit returns is on my mind these days because I've been doing a bunch of coding in Go, which has explicit error returns.)

ExceptionsOverlookProblem written at 01:00:38; Add Comment

2014-10-28

My current somewhat tangled feelings on operator.attrgetter

In a comment on my recent entry on sort comparison functions, Peter Donis asked a good question:

Is there a reason you're not using operator.attrgetter for the key functions? It's faster than a lambda.

One answer is that until now I hadn't heard of operator.attrgetter. Now that I have it's something I'll probably consider in the future.

But another answer is embedded in the reason Peter Donis gave for using it. Using operator.attrgetter is clearly a speed optimization, but speed isn't always the important thing. Sometimes, even often, the most important thing to optimize is clarity. Right now, for me attrgetter is less clear than the lambda approach because I've just learned about it; switching to it would probably be a premature optimization for speed at the cost of clarity.

In general, well, 'attrgetter' is a clear enough thing that I suspect I'll never be confused about what 'lst.sort(key=operator.attrgetter("field"))' does, even if I forget about it and then reread some code that uses it; it's just pretty obvious from context and the name itself. There's a visceral bit of me that doesn't like it as much as the lambda approach because I don't think it reads as well, though. It's also more black magic than lambda, since lambda is a general language construct and attrgetter is a magic module function.

(And as a petty thing it has less natural white space. I like white space since it makes things more readable.)

On the whole this doesn't leave me inclined to switch to using attrgetter for anything except performance sensitive code (which these sort()s aren't so far). Maybe this is the wrong decision, and if the Python community as a whole adopts attrgetter as the standard and usual way to do .sort() key access it certainly will become a wrong decision. At that point I hope I'll notice and switch myself.

(This is an sense an uncomfortable legacy of CPython's historical performance issues with Python code. Attrgetter is clearly a performance hack in general; if lambda was just as fast as it I'd argue that you should clearly use lambda because it's a general language feature instead of a narrowly specialized one.)

AttrgetterVsLamba written at 00:11:20; Add Comment

2014-10-23

The clarity drawback of allowing comparison functions for sorting

I've written before about my unhappiness that Python 3 dropped support for using a comparison function. Well, let me take that back a bit, because I've come around to the idea that there are some real drawbacks to supporting a comparison function here. Not drawbacks in performance (which are comparatively unimportant here) but drawbacks in code clarity.

DWiki's code is sufficiently old that it uses only .sort() cmp functions simply because, well, that's what I had (or at least that's what I was used to). As a result, in two widely scattered spots in different functions its code base contains the following lines:

def func1(...):
    ....
    dl.sort(lambda x,y: cmp(y.timestamp, x.timestamp))
    ....

def func2(...):
    ....
    coms.sort(lambda x,y: cmp(x.time, y.time))
    ....

Apart from the field name, did you see the difference there? I didn't today while I was doing some modernization in DWiki's codebase and converted both of these to the '.sort(key=lambda x: x.FIELD)' form. The difference is that the first is a reverse sort, not a forward sort, because it flips x and y in the cmp().

(This code predates .sort() having a reverse= argument or at least my general awareness and use of it.)

And that's the drawback of allowing or using a sort comparison function: it's not as clear as directly saying what you mean. Small things in the comparison function can have big impacts and they're easy to overlook. By contrast, my intentions and what's going on are clearly spelled out when these things are rewritten into the modern form:

   dl.sort(key=lambda x: x.timestamp, reverse=True)
   coms.sort(key=lambda x: x.time)

Anyone, a future me included, is much less likely to miss the difference in sort order when reading (or skimming) this code.

I now feel that in practice you want to avoid using a comparison function as much as possible even if one exists for exactly this reason. Try very hard to directly say what you mean instead of hiding it inside your cmp function unless there's no way out. A direct corollary of this is that sorting interfaces should try to let you directly express as much as possible instead of forcing you to resort to tricks.

(Note that there are some cases where you must use a comparison function in some form (see especially the second comment).)

PS: I still disagree with Python 3 about removing the cmp argument entirely. It hasn't removed the ability to have custom sort functions; it's just forced you to write a lot more code to enable them and the result is probably even less efficient than before.

SortCmpFunctionClarityIssue written at 00:14:32; Add Comment

2014-10-20

Revisiting Python's string concatenation optimization

Back in Python 2.4, CPython introduced an optimization for string concatenation that was designed to reduce memory churn in this operation and I got curious enough about this to examine it in some detail. Python 2.4 is a long time ago and I recently was prompted to wonder what had changed since then, if anything, in both Python 2 and Python 3.

To quickly summarize my earlier entry, CPython only optimizes string concatenations by attempting to grow the left side in place instead of making a new string and copying everything. It can only do this if the left side string only has (or clearly will have) a reference count of one, because otherwise it's breaking the promise that strings are immutable. Generally this requires code of the form 'avar = avar + ...' or 'avar += ...'.

As of Python 2.7.8, things have changed only slightly. In particular concatenation of Unicode strings is still not optimized; this remains a byte string only optimization. For byte strings there are two cases. Strings under somewhat less than 512 bytes can sometimes be grown in place by a few bytes, depending on their exact sizes. Strings over that can be grown if the system realloc() can find empty space after them.

(As a trivial root, CPython also optimizes concatenating an empty string to something by just returning the other string with its reference count increased.)

In Python 3, things are more complicated but the good news is that this optimization does work on Unicode strings. Python 3.3+ has a complex implementation of (Unicode) strings, but it does attempt to do in-place resizing on them under appropriate circumstances. The first complication is that internally Python 3 has a hierarchy of Unicode string storage and you can't do an in-place concatenation of a more complex sort of Unicode string into a less complex one. Once you have compatible strings in this sense, in terms of byte sizes the relevant sizes are the same as for Python 2.7.8; Unicode string objects that are less than 512 bytes can sometimes be grown by a few bytes while ones larger than that are at the mercy of the system realloc(). However, how many bytes a Unicode string takes up depends on what sort of string storage it is using, which I think mostly depends on how big your Unicode characters are (see this section of the Python 3.3 release notes and PEP 393 for the gory details).

So my overall conclusion remains as before; this optimization is chancy and should not be counted on. If you are doing repeated concatenation you're almost certainly better off using .join() on a list; if you think you have a situation that's otherwise, you should benchmark it.

(In Python 3, the place to start is PyUnicode_Append() in Objects/unicodeobject.c. You'll probably also want to read Include/unicodeobject.h and PEP 393 to understand this, and then see Objects/obmalloc.c for the small object allocator.)

Sidebar: What the funny 512 byte breakpoint is about

Current versions of CPython 2 and 3 allocate 'small' objects using an internal allocator that I think is basically a slab allocator. This allocator is used for all overall objects that are 512 bytes or less and it rounds object size up to the next 8-byte boundary. This means that if you ask for, say, a 41-byte object you actually get one that can hold up to 48 bytes and thus can be 'grown' in place up to this size.

ExaminingStringConcatOptII written at 00:37:10; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.