2014-10-31
A drawback to handling errors via exceptions
Recently I discovered an interesting and long standing bug in DWiki. DWiki is essentially a mature program, so this one was uncovered through the common mechanism of someone using invalid input, in this case a specific sort of invalid URL. DWiki creates time-based views of this blog through synthetic parts of the URLs that end in things like, for example, '.../2014/10/' for entries from October 2014. Someone came along and requested a URL that looked like '.../2014/99/', and DWiki promptly hit an uncaught Python exception (well, technically it was caught and logged by my general error code).
(A mature program usually doesn't have bugs handling valid input, even uncommon valid input. But the many forms of invalid input are often much less well tested.)
To be specific, it promptly coughed up:
calendar.IllegalMonthError: bad month number 99; must be 1-12
Down in the depths of the code that handled a per-month view I was
calling calendar.monthrange() to determine how many days a given month
has, which was throwing an exception because '99' is of course not a
valid month of the year. The exception escaped because I wasn't doing
anything in my code to either catch it or not let invalid months get
that far in the code.
The standard advantage of handling errors via exceptions definitely applied here. Even though I had totally overlooked this error possibility, the error did not get quietly ignored and go on to corrupt further program state; instead I got smacked over the nose with the existence of this bug so I could find it and fix it. But it also exposes a drawback of handling errors with exceptions, which is that it makes it easier to overlook the possibility of errors because that possibility isn't explicit.
The calendar module doesn't document
what exceptions it raises, either in general or especially in the
documentation for monthrange() in specific (where it would be easy
to spot while reading about the function). Because an exception is
effectively an implicit extra return 'value' from functions, it's
easy to overlook the possibility that you'll actually get an exception;
in Python, there's nothing there to rub your nose in it and make you
think about it. And so I never even thought about what happened if
monthrange() was handed invalid input, in part because of the
usual silent assumption that the
code would only be called with valid input because of course DWiki
doesn't generate date range URLs with bad months in them.
Explicit error returns may require a bunch of inconvenient work to handle them individually instead of letting you aggregate exception handling together, but the mere presence of an explicit error return in a method's or function's signature serves as a reminder that yes, the function can fail and so you need to handle it. Exceptions for errors are more convenient and more safe for at least casual programming, but they do mean you need to ask yourself what-if questions on a regular basis (here, 'what if the month is out of range?').
(It turns out I've run into this general issue before, although that time the documentation had a prominent notice that I just ignored. The general issue of error handling with exceptions versus explicit returns is on my mind these days because I've been doing a bunch of coding in Go, which has explicit error returns.)
2014-10-28
My current somewhat tangled feelings on operator.attrgetter
In a comment on my recent entry on sort comparison functions, Peter Donis asked a good question:
Is there a reason you're not using operator.attrgetter for the key functions? It's faster than a lambda.
One answer is that until now I hadn't heard of operator.attrgetter.
Now that I have it's something I'll probably consider in the future.
But another answer is embedded in the reason Peter Donis gave for
using it. Using operator.attrgetter is clearly a speed optimization,
but speed isn't always the important thing. Sometimes, even often,
the most important thing to optimize is clarity. Right now, for me
attrgetter is less clear than the lambda approach because I've
just learned about it; switching to it would probably be a premature
optimization for speed at the cost of clarity.
In general, well, 'attrgetter' is a clear enough thing
that I suspect I'll never be confused about what
'lst.sort(key=operator.attrgetter("field"))' does, even if I forget
about it and then reread some code that uses it; it's just pretty
obvious from context and the name itself. There's a visceral bit of
me that doesn't like it as much as the lambda approach because I
don't think it reads as well, though. It's also more black magic than
lambda, since lambda is a general language construct and attrgetter
is a magic module function.
(And as a petty thing it has less natural white space. I like white space since it makes things more readable.)
On the whole this doesn't leave me inclined to switch to using
attrgetter for anything except performance sensitive code (which these
sort()s aren't so far). Maybe this is the wrong decision, and if the
Python community as a whole adopts attrgetter as the standard and usual
way to do .sort() key access it certainly will become a wrong decision.
At that point I hope I'll notice and switch myself.
(This is an sense an uncomfortable legacy of CPython's historical
performance issues with Python code. Attrgetter is clearly a performance
hack in general; if lambda was just as fast as it I'd argue that you
should clearly use lambda because it's a general language feature
instead of a narrowly specialized one.)
2014-10-23
The clarity drawback of allowing comparison functions for sorting
I've written before about my unhappiness that Python 3 dropped support for using a comparison function. Well, let me take that back a bit, because I've come around to the idea that there are some real drawbacks to supporting a comparison function here. Not drawbacks in performance (which are comparatively unimportant here) but drawbacks in code clarity.
DWiki's code is sufficiently old that it uses only .sort() cmp
functions simply because, well, that's what I had (or at least
that's what I was used to). As a result, in two widely scattered
spots in different functions its code base contains the following
lines:
def func1(...):
....
dl.sort(lambda x,y: cmp(y.timestamp, x.timestamp))
....
def func2(...):
....
coms.sort(lambda x,y: cmp(x.time, y.time))
....
Apart from the field name, did you see the difference there? I didn't
today while I was doing some modernization in DWiki's codebase and
converted both of these to the '.sort(key=lambda x: x.FIELD)'
form. The difference is that the first is a reverse sort, not a
forward sort, because it flips x and y in the cmp().
(This code predates .sort() having a reverse= argument or at least
my general awareness and use of it.)
And that's the drawback of allowing or using a sort comparison function: it's not as clear as directly saying what you mean. Small things in the comparison function can have big impacts and they're easy to overlook. By contrast, my intentions and what's going on are clearly spelled out when these things are rewritten into the modern form:
dl.sort(key=lambda x: x.timestamp, reverse=True) coms.sort(key=lambda x: x.time)
Anyone, a future me included, is much less likely to miss the difference in sort order when reading (or skimming) this code.
I now feel that in practice you want to avoid using a comparison
function as much as possible even if one exists for exactly this
reason. Try very hard to directly say what you mean instead of
hiding it inside your cmp function unless there's no way out.
A direct corollary of this is that sorting interfaces should
try to let you directly express as much as possible instead of
forcing you to resort to tricks.
(Note that there are some cases where you must use a comparison function in some form (see especially the second comment).)
PS: I still disagree with Python 3 about removing the cmp argument entirely. It hasn't removed the ability to have custom sort functions; it's just forced you to write a lot more code to enable them and the result is probably even less efficient than before.
2014-10-20
Revisiting Python's string concatenation optimization
Back in Python 2.4, CPython introduced an optimization for string concatenation that was designed to reduce memory churn in this operation and I got curious enough about this to examine it in some detail. Python 2.4 is a long time ago and I recently was prompted to wonder what had changed since then, if anything, in both Python 2 and Python 3.
To quickly summarize my earlier entry,
CPython only optimizes string concatenations by attempting to grow
the left side in place instead of making a new string and copying
everything. It can only do this if the left side string only has
(or clearly will have) a reference count of one, because otherwise
it's breaking the promise that strings are immutable. Generally
this requires code of the form 'avar = avar + ...' or 'avar +=
...'.
As of Python 2.7.8, things have changed only slightly. In particular
concatenation of Unicode strings is still not optimized; this
remains a byte string only optimization. For byte strings there are two
cases. Strings under somewhat less than 512 bytes can sometimes be grown
in place by a few bytes, depending on their exact sizes. Strings over
that can be grown if the system realloc() can find empty space after
them.
(As a trivial root, CPython also optimizes concatenating an empty string to something by just returning the other string with its reference count increased.)
In Python 3, things are more complicated but the good news is that
this optimization does work on Unicode strings. Python 3.3+ has a
complex implementation of (Unicode) strings, but it does attempt
to do in-place resizing on them under appropriate circumstances.
The first complication is that internally Python 3 has a hierarchy
of Unicode string storage and you can't do an in-place concatenation
of a more complex sort of Unicode string into a less complex one.
Once you have compatible strings in this sense, in terms of byte
sizes the relevant sizes are the same as for Python 2.7.8; Unicode
string objects that are less than 512 bytes can sometimes be grown
by a few bytes while ones larger than that are at the mercy of the
system realloc(). However, how many bytes a Unicode string takes
up depends on what sort of string storage it is using, which I think
mostly depends on how big your Unicode characters are (see this
section of the Python 3.3 release notes and PEP 393 for the gory details).
So my overall conclusion remains as before; this optimization is
chancy and should not be counted on. If you are doing repeated
concatenation you're almost certainly better off using .join()
on a list; if you think you have a situation that's otherwise, you
should benchmark it.
(In Python 3, the place to start is PyUnicode_Append() in
Objects/unicodeobject.c. You'll probably also want to read
Include/unicodeobject.h and PEP 393 to understand this, and
then see Objects/obmalloc.c for the small object allocator.)
Sidebar: What the funny 512 byte breakpoint is about
Current versions of CPython 2 and 3 allocate 'small' objects using an internal allocator that I think is basically a slab allocator. This allocator is used for all overall objects that are 512 bytes or less and it rounds object size up to the next 8-byte boundary. This means that if you ask for, say, a 41-byte object you actually get one that can hold up to 48 bytes and thus can be 'grown' in place up to this size.