Wandering Thoughts archives

2014-01-03

What determines Python 2's remaining lifetime?

In light of recent things one can sensibly ask just how long people can keep using Python 2. At one level the answer is 'until there is some significant problem with Python 2 that the Python developers don't fix'. This could either be a security issue or something important that Python 2 and its core modules don't support (eg, at one point IPv6 would have been such a thing). At the moment it isn't clear how long the Python developers will be fixing things in Python 2; the most recent release was 2.7.6, made November 10th. However here asserts that the Python developers will provide full support until 2015 and probably security fixes afterwards.

But this is an incomplete view of the situation because the Python developers themselves aren't the only people involved; there's also the OS distributions that package some version of Python. Many of these OS distributions actually depend on Python themselves for system tools. If the Python developers abandon fixing Python 2, the OS distributors may well have no choice but to take over. So we can ask a related question of how close OS distributions are to moving away from Python 2.

I will skip to the summary; just as with last time the news is not good for Python 2 going away any time soon. Ubuntu will likely miss their migration target for 14.04 LTS, leaving them needing Python 2 on 14.04 LTS until 2019. Fedora won't transition for at least a year (ie Fedora 22), which means that Red Hat Enterprise Linux 7 will almost certainly ship at some point in 2014 with system tools depending on Python 2, leaving Red Hat supporting Python 2 until at least 2019 and likely later.

(FreeBSD, OmniOS, and so on are less interesting here because as far as I know none of them are promising long support periods the way RHEL and Ubuntu LTS are. However I believe that all of them are still shipping Python 2 as the default and I know that OmniOS has tools that depend on it.)

So my answer is that in practice it's highly likely that Python 2 will get important updates through at least 2020, whether this is from the Python developers themselves or from Linux distributions with long term support needs forking the code and doing their own fixes. Given what Alex Gaynor reports, some of this will likely be driven by customer demand, ie people will likely be deploying Python 2 systems on RHEL 7 and Ubuntu 14.04 LTS and will want any important Python fixes for them.

Python2Lifetime written at 02:53:15; Add Comment

2014-01-02

Python 3's core compatibility problem and decision

In light of the still ongoing issues with Python 3, a very interesting question is what makes moving code to Python 3 so difficult. After all, Python has made transitions before, even transitions with little or no backwards compatibility, and given that there is very little real difference between Python 2 and 3 you would normally expect that people would have no real problems shifting code across to Python 3. After all, things like changing from using print to using print() are not really a big deal.

In my view almost all of the Python 3 issues come down to one decision (or really one aspect of that decision): making all strings Unicode strings, with no real backwards path to working with bytes. Working with Unicode strings instead of byte blobs is a fundamental change to the structure of most programs. Moving many programs to Python 3 requires changing them to be programs that fundamentally work in Unicode and this itself can require a whole host of changes throughout the program's data structures and interfaces, as well as require you to explicitly consider failure points that you didn't need to before. It is this design rethink that is the hard part about moving code, not the mechanical parts of eg changing print to print().

I think that this is also why there is a significant gulf between different people's experiences of working with Python 3. Some people have already been writing code that worked internally in Unicode, even in Python 2. This code is easy to move to Python 3 because it has already dealt with the major issue of that conversion; all that's left is more or less mechanical issues like print versus print(). Other people, with me as an example, have a great deal of code that is really character encoding indifferent (well, more or less) and as a result need to redesign its core for Python 3.

(I think that this also contributes to a certain amount of arrogance on the side of Python 3 boosters, in that they may feel that anyone who was 'sloppily' not working in Unicode already was doing it wrong and so is simply paying the price for their earlier bad habits. To put it one way, this is missing the real problem as usual, never mind the practical issues involved on Unix.)

Honesty requires me to admit that this is a somewhat theoretical view in that I haven't attempted to convert any of my code to Python 3. This is however the aspect of the conversion that I'm most worried about and that strikes me as what would cause me the most problems.

Sidebar: a major contributing factor

I feel that one major factor that makes the Unicode decision a bigger issue than it would otherwise be is that Python 3 doesn't just make literal strings into Unicode, it makes a lot of routines that previously returned bytestrings instead return Unicode strings. Many of these routines are naturally dealing with byte blobs and forced conversion to Unicode thus creates new encoding-related failure points. It is this decision that forces design changes on programs that would otherwise be indifferent to the issues involved because they only shove the byte blobs involved around without looking inside them.

My impression is that Python 3 has gotten somewhat better about this since its initial release in that many more things are willing to work with bytes or can be coerced to return bytes if you find the right magic options. This still leaves you to go through your code to find all of the places that this is needed (and to hope you don't miss a rarely executed code path), and sometimes revising code to account for what the new options really mean.

(For example, you now get bytes out of files by opening them in "rb" mode. However this mode has potentially important behavior differences from the Python 2 plain "r" mode; it does no newline conversion and is buffered differently on ttys.)

Python3CoreProblem written at 01:43:35; Add Comment

2013-12-31

Reversing my view on Python 3 for new general code: avoid it

Just about exactly two years ago I wrote Python3NewCode, in which I waved my hands a bunch and then said:

Ignoring [my handwaved issues] as ultimately unimportant, I don't think there's any reason not to write new, non-sysadmin code in Python 3.

I take all of that back. In retrospect I was being too nice to Python 3 back then and I was wrong to do so. Here is my new view: you should avoid Python 3 even for new code because there is very little to gain from writing in Python 3 and significant downsides to doing so.

(Part of those downsides is that the things that I so blithely handwaved away did not in fact go away and remain as real issues today, two years after I wrote that entry.)

The spark for this reassessment is twofold. First, I have not actually written any of my new Python code in Python 3 (for all sorts of reasons I'm not going to belabor). Second, Alex Gaynor recently wrote 'About Python 3' and this got me thinking about the whole situation and my feelings.

The big problem with Python 3, per Gaynor's article, is that the Python 3 ecosystem is a ghost town. Regardless of whether or not you have Python 3 available on any particular system, the reality is that almost no one is writing Python 3 code. The practical Python ecosystem, the one where people will answer your questions and develop interesting new modules and interesting Python things is Python 2. Useful deployment options are in practice Python 2 ones. If you choose to write in Python 2, you get to take advantage of all of this. If you write in Python 3, not so much. In exchange for giving up all of this you get very little. Most people will find no killer, must-have feature in Python 3 to compensate for the risks and problems you are taking on by using it.

(There are some modules that are only available for Python 3. My impression is that for the most part they come from the core Python developers, precisely because all outside people who are developing modules understand that most of the actual Python programming happens in Python 2.)

Given the complete shambles of the Python 2 to Python 3 transition and the resulting uncertainty about what's going to happen in the longer term, I can't recommend starting even greenfield development in Python 3 unless you have some quite strong reason for it (ie, something important that you can do in Python 3 but not in Python 2). Certainly I reverse my previous position; there's no strong reason to write new code in Python 3 and some good reasons to avoid doing so. Python 2 is here and good today. Even today, Python 3 is still a 'someday maybe in the future' thing.

(At this point I'm not sure if a genuine Python 2 to Python 3 transition will ever happen. The really pessimistic future is that Python 2 shambles on as an increasingly creaky zombie for the next decade, Python 3 effectively fails and becomes irrelevant, and as a result Python itself is abandoned for other languages.)

Python3NewCodeII written at 01:47:12; Add Comment

Link: Alex Gaynor's 'About Python 3'

Alex Gaynor just wrote About Python 3, which is not a bright and happy assessment about the state of Python 3. He says many things that I agree wholeheartedly with, from a position of authority and of good writing. He also crystallizes a number of things for me, such as the following:

Since the time of the Python 3.1 it's been regularly said that the new features and additions the standard library would act as carrots to motivate people to upgrade. Don't get me wrong, Python 3.3 has some really cool stuff in it. But 99% of everybody can't actually use it, so when we tell them "that's better in Python 3", we're really telling them "Fuck You", because nothing is getting fixed for them.

Yes. This. Wholeheartedly this. Every Python 3 only feature or module or improvement might as well be on the far side of the moon as far as it goes for me using it for anything meaningful.

And what he says at the end, too. Everything that the core Python developers are currently doing is completely irrelevant to what I do with Python and will probably be for at least five more years and perhaps as much as a decade. At this point we are living on different planets.

By the way, significant problems surfacing with Python 2 and not getting fixed would not get me to migrate to Python 3. I cannot migrate to Python 3 at this point because it is simply not present on platforms that I use. Very soon my best alternative to Python 2 will probably be Go, because at least I'll be able to compile static binaries for just about everything I care about and push them to the target machines.

(Using Go will suck compared to using Python for the problems that I use Python for, but it will suck less than building and installing my own version of Python 3.)

This is a drum that I have been banging on for some time so of course I'm happy to see it getting attention from people with influence, instead of just my lone voice in a corner. I'd like to think that people like Alex Gaynor speaking up will create actual change but I don't expect that to happen at this point. The core Python developers have to be very invested in their vision of Python 3 and its transition by now; a significant reversal would be very surprising because people almost never reverse such deeply held positions regardless of what they are.

GaynorAboutPython3 written at 01:11:42; Add Comment

2013-12-12

Some observations from playing with PyPy on DWiki

DWiki is the software behind Wandering Thoughts. It makes a convenient test platform for me to experiment with PyPy because it's probably the most CPU-intensive Python code I have anything to do with and also the potentially longest-running program I have, which turns out to be very important for PyPy performance. In the process of doing this today I've wound up with some observations.

(All of these are against PyPy 2.1.0 on a 64-bit Fedora 19 machine.)

My first discovery that it can be relatively hard to make a relatively optimized program descend into true CPU crunching of the sort that PyPy theoretically drastically accelerates. DWiki has significant amounts of caching that try to avoid (theoretically) expensive operations like turning DWikiText into HTML, and in normal operation these caches are hit all of the time. PyPy doesn't seem to be able to do anything too impressive with what's left.

(In reading PyPy performance documentation I see that I'm probably also getting hit by bad PyPy performance on cPickle, as DWiki's caches are pickle-based.)

When I bypassed some of this caching so that my Python code was doing a lot more work, I got confirmation of what I already sort of knew: PyPy required a lot of warmup before it performed well. And by 'performed well' I mean 'ran at least as fast as CPython'. In my code on a very low level operation (simply converting DWikiText to HTML, without any caches), PyPy needed hundreds of repeats of warmup before it crossed over to being faster than CPython. This general issue is common for tracing JITs, but I didn't expect it to be so large for PyPy. CPython has flat performance, of course. The good news is that on this low level task PyPy does eventually wind up faster than CPython (although it's hard to say how much faster; my test framework may over-specialize the generated code at present).

(This warmup issue has significant implications for preforking network servers. You likely need to have any given worker process handle quite a lot of requests before PyPy is at all worth it, and that raises concerns with slow memory leaks and so on.)

So far I have only talked about CPU usage and haven't mentioned memory usage. There's a good reason for that: for DWiki, PyPy's memory usage is through the roof. My test setup consistently has CPython at around 13 MB of active RAM (specifically RSS). PyPy doing the same thing takes anywhere from 70 MB to 130 MB depending on exactly what I'm testing. In many situations today this is a relative killer (especially again if you're dealing with a preforking network server, since PyPy memory usage seems to grow over time and that implies every child worker process will have its own copy).

My overall observation from this is unsurprising and unexciting, namely that PyPy is not a drop in magic way of speeding up my code. It may work, it may not work, it may require code changes to work well (and right now the tools for finding what code to change are a bit lacking), and I won't know until I try. Unfortunately all of this uncertainty reduces my interest in PyPy.

(I have seen PyPy clearly beat Python, so it can definitely happen.)

PyPyDWikiExperiments written at 02:04:17; Add Comment

2013-12-10

My current view of PyPy

In a comment on my entry about Go as a 'Python with performance' for me, I was asked about my views on using PyPy for this. I flailed around a bit in my reply there and since then I've been thinking about it more, so it's time to go on at more length.

The simple version is that today I think of PyPy as perhaps a way to make some Python programs go faster but not as a way to write fast Python programs. If I have an existing Python program that fit what I think of as the PyPy profile (long-running, generally does basic operations, and I'm indifferent to memory usage) and I absolutely needed it to go faster, I'd consider feeding it to PyPy to see what'd happen. If it speeds up without exploding the memory usage, I've won and I can stop. If that doesn't work, well, time for other measures. However, PyPy is too unpredictable to me for me to be able to write Python code that I can count on it speeding up dramatically, especially if I also want to control the memory usage and so on.

There are other pragmatic issues with using it. For a start, the version of PyPy available to me through distribution packages varies widely from system to system here and with that variance I can expect an equally large performance variance. The current version of PyPy is 2.2.1 while Fedora 19 has 2.1.0 and Ubuntu 12.04 LTS is back at 1.8. Beyond that, a certain amount of interesting Python environments just don't work with PyPy; for example, I can't use PyPy to speed up parts of a Django app deployed through mod_wsgi (not that the app is likely to have a performance bottleneck anyways, that's an illustration).

There's also two serious problems with PyPy today that make it far less interesting for me (at least as of the Fedora 19 version of 2.1.0). The first is what I alluded to above; PyPy has a significant startup delay before it starts speeding up your program and thus doesn't really speed up short running things. I'm pretty sure that if I had a Python program that ran in a second, PyPy wouldn't speed it up very much. The second is that PyPy quietly explodes on common Python idioms under some circumstances.

For an example that I have personally run into, consider:

data = open("/some/file", "r").read()

This is a not uncommon Python idiom to casually read in a file. If you try this in a PyPy-run program in any sort of situation where you do this repeatedly, you'll probably wind up with a 'too many open files' error before too long. In straight (C)Python the open file is immediately garbage collected at the end of the .read(); in PyPy, it seems to hang around (presumably for a full garbage collection run) and with it the open file descriptor. Boom.

Yes, yes, you say, this is bad style. The reality is that this 'bad style' is common in Python, as are other examples where code assumes that dropped or out of scope objects will be immediately garbage collected. I don't want to spend my time troubleshooting mysterious problems in otherwise reliable long-running Python programs that only appears when I run them under PyPy. Not running them under PyPy is by far the easier solution, even if it costs me performance.

(In my opinion non-deterministic garbage collection is actually a serious problem, but that's another entry.)

PyPyView written at 00:49:07; Add Comment

2013-11-27

The difference between CPython and Python

Sometimes when I'm writing about Python things, I talk about 'CPython' (as I did yesterday). This is insider jargon; CPython is the term of art that's generally used when we're specifically referring to the behavior of the main implementation of Python (which is written in C, hence the 'CPython' coinage). This is the implementation that gets most of the publicity and a starring role on python.org. CPython is the Python that is 'version 2.7.6' and 'version 3.3.3' (as of right now) and is what the core Python developers work on. But it's not the only implementation of Python that exists. Today the most prominent other implementation of Python is probably PyPy; other implementations include Jython (Python in the JVM) and IronPython (Python in the CLR).

CPython is the original version of Python and for a long time it was the only Python that existed. It's still the authoritative version that everyone else is expected to be compatible with, because there is no comprehensive language specification for 'Python the language'. This is pretty common with all sorts of languages these days, which are generally implemented first and standardized later if at all.

(Among other reasons for this, writing a comprehensive language specification is a lot of work and then it is even more work to keep updating it as you change the language. And you don't really know if your specification was comprehensive enough until some crazy person attempts a second implementation purely from the specification without looking at how your language implementation behaves. If their implementation is fully compatible, your specification was a good one.)

I (and others) talk about CPython when we're talking about things that are specific to how CPython is implemented, that are specifically documented as implementation dependent, or that are simply likely to be implementation dependent rather than slavishly copied by everyone who does a Python implementation. For obvious reasons, pretty much all of the low level details of how CPython works fall into this general category, eg other Python implementations are unlikely to copy CPython's bytecode architecture. Where the boundary is between low level behavior and high level behavior is an interesting and sometimes debatable question (as is what is likely to wind up being implementation dependent).

(Note that all Python implementations have the 'Python 2 vs Python 3' issue, because the changes between Python 2 and Python 3 are general language changes.)

CPythonVsPython written at 01:59:43; Add Comment

2013-11-25

From CPython bytecode up to function objects (in brief)

Python bytecode is the low level heart of (C)Python; it's what the CPython interpreter actually processes in order to run your Python code. The dis module is the heart of information on examining bytecode and on the bytecodes themselves. But CPython doesn't just run bytecode in isolation. In practice bytecode is always part of some other object, partly because bytecode by itself is not self-contained; it relies on various other things for context.

Bytecode by itself looks like this:

>>> fred.func_code.co_code
'|\x00\x00G|\x01\x00GHd\x00\x00S'

(That's authentic bytecode; you can feed it to dis.dis() to see what it means in isolation.)

I believe that Python bytecode is always found embedded in a code object. Code objects have two sorts of additional attributes; attributes which provide the necessary surrounding context that the bytecode itself needs, and attributes that just have information about the code that's useful for debugging. Examples of context attributes are co_consts, a tuple of constants used in the bytecode, and co_nlocals, the number of local variables that the code uses. Examples of information attributes are co_filename, co_firstlineno, and even co_varnames (which tells you what local variable N is called). Note that the context attributes are absolutely essential; bytecode is not self-contained and cannot be run in isolation without them. Many bytecodes simply do things like, say 'load constant 0'; if you don't know what constant 0 is, you're not going to get far with the bytecode. It is the code object that tells you this necessary stuff.

Most code objects are embedded in function objects (as the func_code attribute). Function objects supply some additional context attributes that are specific to using a piece of code as a function, as well as another collection of information about the function (most prominently func_doc, the function's docstring if any). As it happens, all of the special function attributes are documented reasonably well in the official Python data model, along with code objects and much more.

(Because I just looked it up, the mysterious func_dict property is another name for a function's __dict__ attribute, which is used to allow you to add arbitrary properties to a function. See PEP 232. Note that functions don't actually have a dictionary object attached to func_dict until you look at it or otherwise need it.)

Function objects themselves are frequently found embedded in instance method objects, which are used for methods on classes (whether bound to an object that's an instance of the class or unbound). But that's as far up the stack as I want to go today and anyways, instance method objects only have three attributes and they're all pretty obvious.

(If you have a class A with a method function fred, A.fred is actually an (unbound) instance method object. The fred function itself is A.fred.im_func, or if you want, A.__dict__["fred"].)

Note that not all code objects are embedded in function objects. For example, if you call compile() what you get back is a bare code object. I suspect that module level code winds up as a code object before getting run by the interpreter, but I haven't looked at the interpreter source to see so don't quote me on that.

(This entry was inspired by reading this introduction to the CPython interpreter (via Hacker News), which goes at things from the other direction.)

BytecodeToFunctions written at 23:11:13; Add Comment

2013-11-17

Sending and receiving file descriptors in Python

On some but not all modern Unix systems, file descriptors (the underlying operating system level thing behind open files, sockets, and so on) can be passed between cooperating processes using Unix domain sockets and special options to sendmsg() and recvmsg(). There are a number of uses for this under various circumstances; the one that I'm interested in is selectively offloading incoming network connections from one process to another one that is better suited to handle some particular connections.

In Python 3.3 and later, doing this is simple because it is directly supported by the socket module. The documentation even includes code examples for both sendmsg() and recvmsg(), which is handy because they don't exactly have the most Pythonic of interfaces; instead it's basically a thin cover over the system call data structures. If you are receiving file descriptors that are sockets you're still left with the socket .fromfd() problem.

(I was encouraged to report the socket fd problem as an actual Python bug, where it has quietly been neglected just as I expected.)

Unfortunately Python 2 does not have any support for this in the socket module, thereby creating yet another gratuitous Python 2 to Python 3 difference. Fortunately a number of people have written add on modules to support this; the ones I found in a casual Internet search are python-passfd, python-fdsend, and sendmsg (which is notably lacking in documentation). Of these, python-fdsend seems to have the best API (and is packaged for Debian and Ubuntu); I expect that it's what I'll use if (or when) I need this feature in my Python 2 code. Note that it doesn't solve the socket .fromfd() problem.

If you're sending sockets to another process, remember that it is safe to call .close() on them afterwards but it is not safe to call .shutdown() on them; as I discovered, shutdown() is a global operation on a socket and applies to all file descriptors for it, including ones now held by other processes.

SendingFileDescriptors written at 02:01:49; Add Comment

2013-10-09

An interesting bug with module garbage collection in Python

In response to my entry on what happens when modules are destroyed, @eevee shared the issue that started it all:

@thatcks the confusion arose when the dev did `module_maker().some_function_returning_a_global()` and got None :)

In a subsequent exchange of tweets, we sorted out why this happens. What it boils down to is a module is not the same as the module's namespace and functions only hold a reference to the module namespace, not the module itself.

(Functions have a __module__ attribute but this a string, not a reference to the module itself.)

So here's what is going on. When this chunk of code runs module_maker() loads and returns a module as an anonymous object then the interpreter uses that anonymous module object to look up the function. Since the function does not hold a reference to the module itself, the module object is unreferenced after the lookup has finished and is thus immediately garbage collected. This garbage collection destroys the contents of the module namespace dictionary, but the dictionary itself is not garbage collected because the function holds a reference to it and the interpreter holds a reference to the function. Then the function's code runs and uses its reference to the dictionary to look up a (module) global, which finds the name and a None value for it.

(You would get even more comedy if the module function tried to call another module level function or create an instance of a module level class; this would produce mysterious 'TypeError: `NoneType' object is not callable' errors since the appropriate name is now bound to None instead of a callable thing.)

The workaround is straightforward; you just have to store the module object in a local variable before looking up the function so that a reference to it persists over the function call and thus avoids it being garbage collected.

The good news is that this weird behavior did wind up being accepted as a Python bug; it's issue 18214 and is fixed in the forthcoming Python 3.4. Given the views of the Python developers, it will probably never be fixed in Python 2 and will thus leave people with years of having to work around it.

(It's hopefully obvious why this is a bug. Given that modules and module namespaces are separate things and that a module's namespace can outlive it for various reasons, a module being garbage collected should not result in its namespace dictionary getting trashed. This sort of systematic destruction of module namespaces should only happen when it's really necessary, namely during interpreter shutdown.)

ModuleGCBug written at 00:23:10; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.