2014-11-12
A wish: setting Python 3 to do no implicit Unicode conversions
In light of the lurking Unicode conversion issues in my DWiki port to Python 3, one of the things I've realized I would like in Python 3 is some way to turn off all of the implicit conversions to and from Unicode that Python 3 currently does when it talks to the outside world.
The goal here is the obvious one: since any implicit conversion is a place where I need to consider how to handle errors, character encodings, and so on, making them either raise errors or produce bytestrings would allow me to find them all (and to force me to handle things explicitly). Right now many implicit conversions can sail quietly past because they're only having to deal with valid input or simple output, only to blow up in my face later.
(Yes, in a greenfield project you would be paying close attention to all places where you deal with the outside world. Except of course for the ones that you overlook because you don't think about them and they just work. DWiki is not in any way a greenfield project and in Python 2 it arrogantly doesn't use Unicode at all.)
It's possible that you can fake this by setting your (Unix) character encoding to either an existing encoding that is going to blow up on utf-8 input and output (including plain ASCII) or to a new Python encoding that always errors out. However this gets me down into the swamps of default Python encodings and how to change them, which I'm not sure I want to venture into. I'd like either an officially supported feature or an easy hack. I suspect that I'm dreaming on the former.
(I suspect that there are currently places in Python 3 that always both always perform a conversion and don't provide an API to set the character encoding for the conversion. Such places are an obvious problem for an official 'conversion always produces errors' setting.)
2014-11-10
What it took to get DWiki running under Python 3
For quixotic reasons I recently decided to see how far I could get with porting DWiki (the code behind this blog) to Python 3 before I ran out of either patience or enthusiasm. I've gotten much further than I expected; at this point I'm far enough that it can handle this entire site when running under Python's builtin basic HTTP server, rendering the HTML exactly the same as the Python 2 version does.
Getting this far basically took three steps. The largest step was
updating the code to modern Python 2,
because Python 3 doesn't accept various bits of old syntax. After
I'd done that, I ran 2to3 over the codebase to do a bunch of
mechanical substitutions, mostly rewriting print statements
and standard modules that had gotten reorganized in the transition.
The final necessary step was some Unicode conversion and mangling
(and with it reading some files in binary mode).
All of this sounds great, but the reality is that DWiki is only limping along under Python 3 and this is exactly because of the Unicode issue. Closely related to this is that I have not revised my WSGI code for any changes in the Python 3 version of WSGI (I'm sure there must be some, just because of character encoding issues). Doing a real Python 3 port of DWiki would require dealing with this, which means going through everywhere that DWiki talks to the outside world (for file IO, for logging, and for reading and replying to HTTP requests), figuring out where the conversion boundary is between Unicode and bytestrings, what character encoding I need to use and how to recognize this, and finally what to do about encoding and decoding errors. Complicating this is that some of these encoding boundaries are further upstream than you might think. Two closely related cases I've run into so far is that DWiki computes the ETag and Content-Length for the HTTP reply itself, and for obvious reasons both of these must be calculated against the encoded bytestring version of the content body instead of its original Unicode version. This happens relatively far inside my code, not right at the boundary between WSGI and me.
(Another interesting case is encoding URLs that have non-ASCII characters in them, for example from a page with a name that has Unicode characters in it. Such URLs can get encoded both in HTML and in the headers of redirects, and need to be decoded at some point on the way in, where I probably need to %-decode to a bytestring and then decode that bytestring to a Unicode string.)
Handling encoding and decoding errors are a real concern of mine
for a production quality version of DWiki in Python 3. The problem
is that most input these days is well behaved, so you can go quite
a while before someone sends you illegal UTF-8 in headers, URLs,
or POST bodies (or for that matter sends you something in another
character set). This handily disguises failures to handle encoding
and decoding problems, since things work almost all the time. And
Python 3 has a lot of places with implicit conversions.
That these Unicode issues exist doesn't surprise me. Rather the reverse; dealing with Unicode has always been the thing that I thought would be hardest about any DWiki port to Python 3. I am pleasantly surprised by how few code changes were required to get to this point, as I was expecting much more code changes (and for them to be much more difficult to make, I think because at some point I'd got the impression that 2to3 wasn't very well regarded).
Given the depths of the Unicode swamps here, I'm not sure that I'll go much further with a Python 3 version of DWiki than I already have. But, as mentioned, it is both nice and surprising to me that I could get this far with this little effort. The basics of porting to Python 3 are clearly a lot less work than I was afraid of.
2014-11-07
Porting to Python 3 by updating to modern Python 2
For quixotic reasons I decided to take a shot at porting DWiki to Python 3 just to see how difficult and annoying it would be and how far I could get. One of the surprising things about the process has been that a great deal of porting to Python 3 has been less about porting the code and more about modernizing it to current Python 2 standards.
DWiki is what is now a pretty old codebase (as you might guess) and even when it was new it wasn't written with the latest Python idioms for various reasons, including that I started with Python back in the Python 1.5 era. As a result it contained a number of long obsolete idioms that are very much not supported in Python 3 and had to be changed. Once the dust settled it turned out that modernizing these idioms was most (although not all) of what was needed to make DWiki at least start up under Python 3.
At this point you might be wondering just what ancient idioms I was still using. I'm glad you asked. DWiki was doing all of these:
- '
raise EXCEPTION, STR' instead of 'raise E(STR)'. I have no real excuse here; I'm sure this was considered obsolete even when I started writing DWiki. - '
except CLS, VAR:' instead of 'except CLS as VAR:', which I think is at least less ancient than myraiseusage. - using comparison functions in
.sort()instead ofkey=...andreverse=True. Switching made things clearer. - dividing two integers with '
/' and expecting the result to be an integer. In Python 3 this is an exact float instead, which caused an interesting bug when I used the result as an (integer) counter. Using '//' explicitly is better and is needed in Python 3.
I consider this modernization of the Python 2 codebase to be a good thing. Even if I never do anything with a Python 3 version of DWiki, updating to the current best practice idioms is an improvement of the code (especially since it's public and I'd like it to not be too embarrassing). I'm glad that trying out a Python 3 port has pushed me into doing this; it really has been overdue.
(Another gotcha that Python 3 exposed is that in at least one place
I was assuming that 'None > 0' was a valid comparison to make and
would be False. This works in Python 2 but it's not exactly a
good idea and fixing the code to explicitly check for None is a
good cleanup. Since this sort of stuff can only really be checked
dynamically there may be other spots that do this.)
2014-10-31
A drawback to handling errors via exceptions
Recently I discovered an interesting and long standing bug in DWiki. DWiki is essentially a mature program, so this one was uncovered through the common mechanism of someone using invalid input, in this case a specific sort of invalid URL. DWiki creates time-based views of this blog through synthetic parts of the URLs that end in things like, for example, '.../2014/10/' for entries from October 2014. Someone came along and requested a URL that looked like '.../2014/99/', and DWiki promptly hit an uncaught Python exception (well, technically it was caught and logged by my general error code).
(A mature program usually doesn't have bugs handling valid input, even uncommon valid input. But the many forms of invalid input are often much less well tested.)
To be specific, it promptly coughed up:
calendar.IllegalMonthError: bad month number 99; must be 1-12
Down in the depths of the code that handled a per-month view I was
calling calendar.monthrange() to determine how many days a given month
has, which was throwing an exception because '99' is of course not a
valid month of the year. The exception escaped because I wasn't doing
anything in my code to either catch it or not let invalid months get
that far in the code.
The standard advantage of handling errors via exceptions definitely applied here. Even though I had totally overlooked this error possibility, the error did not get quietly ignored and go on to corrupt further program state; instead I got smacked over the nose with the existence of this bug so I could find it and fix it. But it also exposes a drawback of handling errors with exceptions, which is that it makes it easier to overlook the possibility of errors because that possibility isn't explicit.
The calendar module doesn't document
what exceptions it raises, either in general or especially in the
documentation for monthrange() in specific (where it would be easy
to spot while reading about the function). Because an exception is
effectively an implicit extra return 'value' from functions, it's
easy to overlook the possibility that you'll actually get an exception;
in Python, there's nothing there to rub your nose in it and make you
think about it. And so I never even thought about what happened if
monthrange() was handed invalid input, in part because of the
usual silent assumption that the
code would only be called with valid input because of course DWiki
doesn't generate date range URLs with bad months in them.
Explicit error returns may require a bunch of inconvenient work to handle them individually instead of letting you aggregate exception handling together, but the mere presence of an explicit error return in a method's or function's signature serves as a reminder that yes, the function can fail and so you need to handle it. Exceptions for errors are more convenient and more safe for at least casual programming, but they do mean you need to ask yourself what-if questions on a regular basis (here, 'what if the month is out of range?').
(It turns out I've run into this general issue before, although that time the documentation had a prominent notice that I just ignored. The general issue of error handling with exceptions versus explicit returns is on my mind these days because I've been doing a bunch of coding in Go, which has explicit error returns.)
2014-10-28
My current somewhat tangled feelings on operator.attrgetter
In a comment on my recent entry on sort comparison functions, Peter Donis asked a good question:
Is there a reason you're not using operator.attrgetter for the key functions? It's faster than a lambda.
One answer is that until now I hadn't heard of operator.attrgetter.
Now that I have it's something I'll probably consider in the future.
But another answer is embedded in the reason Peter Donis gave for
using it. Using operator.attrgetter is clearly a speed optimization,
but speed isn't always the important thing. Sometimes, even often,
the most important thing to optimize is clarity. Right now, for me
attrgetter is less clear than the lambda approach because I've
just learned about it; switching to it would probably be a premature
optimization for speed at the cost of clarity.
In general, well, 'attrgetter' is a clear enough thing
that I suspect I'll never be confused about what
'lst.sort(key=operator.attrgetter("field"))' does, even if I forget
about it and then reread some code that uses it; it's just pretty
obvious from context and the name itself. There's a visceral bit of
me that doesn't like it as much as the lambda approach because I
don't think it reads as well, though. It's also more black magic than
lambda, since lambda is a general language construct and attrgetter
is a magic module function.
(And as a petty thing it has less natural white space. I like white space since it makes things more readable.)
On the whole this doesn't leave me inclined to switch to using
attrgetter for anything except performance sensitive code (which these
sort()s aren't so far). Maybe this is the wrong decision, and if the
Python community as a whole adopts attrgetter as the standard and usual
way to do .sort() key access it certainly will become a wrong decision.
At that point I hope I'll notice and switch myself.
(This is an sense an uncomfortable legacy of CPython's historical
performance issues with Python code. Attrgetter is clearly a performance
hack in general; if lambda was just as fast as it I'd argue that you
should clearly use lambda because it's a general language feature
instead of a narrowly specialized one.)
2014-10-23
The clarity drawback of allowing comparison functions for sorting
I've written before about my unhappiness that Python 3 dropped support for using a comparison function. Well, let me take that back a bit, because I've come around to the idea that there are some real drawbacks to supporting a comparison function here. Not drawbacks in performance (which are comparatively unimportant here) but drawbacks in code clarity.
DWiki's code is sufficiently old that it uses only .sort() cmp
functions simply because, well, that's what I had (or at least
that's what I was used to). As a result, in two widely scattered
spots in different functions its code base contains the following
lines:
def func1(...):
....
dl.sort(lambda x,y: cmp(y.timestamp, x.timestamp))
....
def func2(...):
....
coms.sort(lambda x,y: cmp(x.time, y.time))
....
Apart from the field name, did you see the difference there? I didn't
today while I was doing some modernization in DWiki's codebase and
converted both of these to the '.sort(key=lambda x: x.FIELD)'
form. The difference is that the first is a reverse sort, not a
forward sort, because it flips x and y in the cmp().
(This code predates .sort() having a reverse= argument or at least
my general awareness and use of it.)
And that's the drawback of allowing or using a sort comparison function: it's not as clear as directly saying what you mean. Small things in the comparison function can have big impacts and they're easy to overlook. By contrast, my intentions and what's going on are clearly spelled out when these things are rewritten into the modern form:
dl.sort(key=lambda x: x.timestamp, reverse=True) coms.sort(key=lambda x: x.time)
Anyone, a future me included, is much less likely to miss the difference in sort order when reading (or skimming) this code.
I now feel that in practice you want to avoid using a comparison
function as much as possible even if one exists for exactly this
reason. Try very hard to directly say what you mean instead of
hiding it inside your cmp function unless there's no way out.
A direct corollary of this is that sorting interfaces should
try to let you directly express as much as possible instead of
forcing you to resort to tricks.
(Note that there are some cases where you must use a comparison function in some form (see especially the second comment).)
PS: I still disagree with Python 3 about removing the cmp argument entirely. It hasn't removed the ability to have custom sort functions; it's just forced you to write a lot more code to enable them and the result is probably even less efficient than before.
2014-10-20
Revisiting Python's string concatenation optimization
Back in Python 2.4, CPython introduced an optimization for string concatenation that was designed to reduce memory churn in this operation and I got curious enough about this to examine it in some detail. Python 2.4 is a long time ago and I recently was prompted to wonder what had changed since then, if anything, in both Python 2 and Python 3.
To quickly summarize my earlier entry,
CPython only optimizes string concatenations by attempting to grow
the left side in place instead of making a new string and copying
everything. It can only do this if the left side string only has
(or clearly will have) a reference count of one, because otherwise
it's breaking the promise that strings are immutable. Generally
this requires code of the form 'avar = avar + ...' or 'avar +=
...'.
As of Python 2.7.8, things have changed only slightly. In particular
concatenation of Unicode strings is still not optimized; this
remains a byte string only optimization. For byte strings there are two
cases. Strings under somewhat less than 512 bytes can sometimes be grown
in place by a few bytes, depending on their exact sizes. Strings over
that can be grown if the system realloc() can find empty space after
them.
(As a trivial root, CPython also optimizes concatenating an empty string to something by just returning the other string with its reference count increased.)
In Python 3, things are more complicated but the good news is that
this optimization does work on Unicode strings. Python 3.3+ has a
complex implementation of (Unicode) strings, but it does attempt
to do in-place resizing on them under appropriate circumstances.
The first complication is that internally Python 3 has a hierarchy
of Unicode string storage and you can't do an in-place concatenation
of a more complex sort of Unicode string into a less complex one.
Once you have compatible strings in this sense, in terms of byte
sizes the relevant sizes are the same as for Python 2.7.8; Unicode
string objects that are less than 512 bytes can sometimes be grown
by a few bytes while ones larger than that are at the mercy of the
system realloc(). However, how many bytes a Unicode string takes
up depends on what sort of string storage it is using, which I think
mostly depends on how big your Unicode characters are (see this
section of the Python 3.3 release notes and PEP 393 for the gory details).
So my overall conclusion remains as before; this optimization is
chancy and should not be counted on. If you are doing repeated
concatenation you're almost certainly better off using .join()
on a list; if you think you have a situation that's otherwise, you
should benchmark it.
(In Python 3, the place to start is PyUnicode_Append() in
Objects/unicodeobject.c. You'll probably also want to read
Include/unicodeobject.h and PEP 393 to understand this, and
then see Objects/obmalloc.c for the small object allocator.)
Sidebar: What the funny 512 byte breakpoint is about
Current versions of CPython 2 and 3 allocate 'small' objects using an internal allocator that I think is basically a slab allocator. This allocator is used for all overall objects that are 512 bytes or less and it rounds object size up to the next 8-byte boundary. This means that if you ask for, say, a 41-byte object you actually get one that can hold up to 48 bytes and thus can be 'grown' in place up to this size.
2014-09-27
DWiki, Python 3, Python, and me
A while back I tweeted:
Programming in #golang remains fun. I'm not sure if this is true for me for Python any more, but maybe I need the right project.
One of the problems for me with Python programming is that I kind of have a millstone and this millstone intersects badly with Python 3, which I kind of want to be using.
I have a number of Python projects, both work and personal. The stuff for work is not moving to Python 3, in significant part because most of our systems don't have good versions of Python 3 (or sometimes any versions of it). Most of my personal Python projects are inactive (eg), generally because I don't have much use for them any more. The personal project that is the exception is DWiki, the software behind Wandering Thoughts.
Unfortunately DWiki's source code is kind of a mess and as a result DWiki itself is sort of a millstone. DWiki has grown far larger than I initially imagined it ever would be and I didn't do a great job of designing it from the start (partly because I did not really understand many of the problems I was dealing with when I started writing it, which resulted in some core design missteps, and partly because it changed directions during development). The code has wound up being old and tangled and not very well commented. One of the consequences of this is that making any changes at all takes a surprising amount of work, partly just to recover my original understanding of the code, and as a result I need to have a lot of energy and enthusiasm to actually start trying to make any change.
(For instance, I've wanted to add entry tags to DWiki for a long time and I even have a strawman design in my head. What I haven't had so far is enough time and energy to propel me to dive into the code and get it done. And partly this is because the idea of working on the mess of DWiki's code just plain saps my enthusiasm.)
DWiki is currently a Python 2 program. I expect that moving it to Python 3 would take a fair amount of work and a lot of mucking around in the depths of its code (and then a bunch more work to make it use any relevant exciting Python 3 features). In fact the very idea of attempting the conversion is extremely daunting. But at the same time DWiki is the only Python program I'm likely to work on any time soon and the only one that is really important to carry forward to a Python 3 future (because it's the one program I expect to be running for a long time).
(Of course DWiki has no tests as such, especially unit tests. My approach for testing important changes is scripting to render all pages of CSpace in an old and a new code version and then compare the rendered output.)
So there my millstone sits, sapping my enthusiasm for dealing with it and by extension my enthusiasm for Python 3. I would be reasonably happy to magically have a Python 3 version of DWiki and I'm sure it would prompt me to dive into Python 3 in a fairly big way, but I can't see how I actually get to that future. Life be different if I could see a way that Python 3 would be a really big win for DWiki (such as significantly speeding it up or allowing me to drastically simplify chunks of code), but I don't believe that (and I know that Python 3 will bring complications).
(Life would also be different if DWiki didn't work very well for some reason (or needed maintenance) and I clearly needed to do something to it. But the truth is it works pretty well as it is. It's just missing wishlist items, such as tags and running under Python 3.)
PS: on the other hand, if I keep thinking and talking about DWiki and Python 3, maybe I'll talk myself into trying a conversion just to see how difficult it really is. The idea has a certain perverse attraction.
Sidebar: Why a major rewrite is not the answer
At this point some people will urge me to gut major portions of the current code (or all of it) and rebuild it from scratch, better and cleaner and so on. The simple answer about this is that if I was going to redo DWiki from more or less scratch (which has a number of attractions), I don't see why I'd do it in Python 3 instead of in Go. Programming in Python 3 would likely be at least somewhat faster than in Go but I don't think it would be a massive difference, while the Go version would almost certainly run faster and it would clearly have a much simpler deployment story.
So why not go ahead and rewrite DWiki in Go? Because I don't want to do that much work, especially since DWiki works today and I don't think I'd gain any really major wins from a rewrite (I've pretty much scrubbed away all of DWiki's pain points for day to day usage already).
2014-09-22
Another side of my view of Python 3
I have been very down on Python 3 in the past. I remain sort of down on it, especially in the face of substantial non-current versions on the platforms I use and want to use, but there's another side of this that I should admit to: I kind of want to be using Python 3.
What this comes down to at its heart is that for all the nasty things I say about it, Python 3 is where the new and cool stuff is happening in Python. Python 3 is where all of the action is and I like that in general. Python 2 is dead, even if it's going to linger on for a good long while, and I can see the writing on the wall here.
(One part of that death is that increasingly, interesting new modules are only going to be Python 3 or are going to be Python 3 first and only Python 2 later and half-heartedly.)
And Python 3 is genuinely interesting. It has a bunch of new idioms to get used to, various challenges to overcome, all sorts of things to learn, and so on. All of these are things that generally excite me as a programmer and make it interesting to code stuff (learning is fun, provided I have a motivation).
Life would be a lot easier if I didn't feel this way. If I felt that Python 3 had totally failed as a language iteration, if I thought it had taken a terrible wrong turn that made it a bad idea, it would be easy to walk away from it entirely and ignore it. But it hasn't. While I dislike some of its choices and some of them are going to cause me pain, I do expect that the Python 3 changes are generally good ones (and so I want to explore them). Instead, I sort of yearn to program in Python 3.
So why haven't I? Certainly one reason is that I just haven't been writing new Python code lately (and beyond that I have real concerns about subjecting my co-workers to Python 3 for production code). But there's a multi-faceted reason beyond that, one that's going to take another entry to own up to.
(One aspect of the no new code issue is that another language has been competing for my affections and doing pretty well so far. That too is a complex issue.)
2014-08-23
Some notes on Python packaging stuff that wasn't obvious to me
A comment by Lars Kellogg-Stedman on
this entry of mine wound up with
me wanting to try out his lvcache utility, which is a Python program
that's packaged with a setup.py. Great, I thought, I know how to
install these things.
Well, no, not any more. While I wasn't looking, Python packaging systems have gotten absurdly complex and annoying (and yes, one of the problems is that there are more than one of them). My attempts to install lvcache (either privately or eventually system-wide in a sacrificial virtual machine) failed in various ways. In the process they left me very frustrated because I had very little understanding of what a modern Python setup does when. Since I now have somewhat more understanding I'm going to write up what I know.
Once upon a time there was just site-packages with .py files
and plain directories in it, and life was simple and good. If you
wanted to you could augment the standard site-packages by setting
$PYTHONPATH; the additional directories would be searched for
.py files and plain directories too. Modern Python has added some
wrinkles:
.pthfiles list additional paths that will be used for importing things from (generally relative to the directory you find them in). These additional import paths are visible insys.path, so if you're not sure if a.pthfile is working you can start Python and check whatsys.pathreports..pthfiles in standard locations are loaded automatically; this includes your personal 'user' directory (on Unix, generally$HOME/.local/lib/pythonX.Y/site-packages, ie what 'python setup.py install --user' et al will use). However,.pthfiles in directories that are merely on your$PYTHONPATHare not automatically loaded by Python and must be bootstrapped somehow; if you useeasy_install --prefix, it will stick asite.pyfile to do this in the directory.(There are some really weird things that go on with
.pthfiles. See Armin Ronacher.).eggfiles are ZIP files, which Python can import code from directly. They contain metadata and a module directory with.pyfiles and normally appear directly onsys.path(eg the.eggfile is listed itself). You can inspect.eggfile contents with 'unzip -v thing.egg'. Under some circumstances it's possible for the install process to build a.eggthat doesn't contain any Python code (or contains incomplete Python code); if you're facing mysterious failures, you may need to check for this..eggdirectories are unpacked versions of the ZIP versions above. I don't know wheneasy_installet al create directories versus files. Like the files they appear onsys.pathdirectly. They can be inspected directly.
Modern installers no longer just put files and module directories
in places. Instead, they make or obtain eggs and install the eggs.
The good news is that things like easy_install follow dependencies
(assuming that everyone has properly specified them, not always a
given). The bad news is that this is much less inspectable than the
old days.
(Okay, the other good news is that you can see which version of what you've installed by hand, instead of having a mess of stuff.)
In a properly functionally installed environment you should be able
to fire up an interactive Python session and do 'import <module>'
for every theoretically installed module. If this fails, either any
.pth files are not getting bootstrapped (which can be checked by
looking at sys.path), you don't have a module installed that you
think you should, or perhaps the module is empty or damaged.
I'm sure all of this is documented in one or more places in the official Python documentation, but it is sure not easy to find if it is (and I really don't think there's one place that puts it all together).
PS: if you're installing a local copy of a package's source you
want 'easy_install .' (in the source directory), likely with
--user or --prefix. At least some of the time, easy_install
will insist that you precreate the --prefix directory for it; it
will always insist that you add it to $PYTHONPATH.
(The current anarchy of Python packaging and install systems requires another rant but I am too exhausted for it right now.)