My current view of PyPy

December 10, 2013

In a comment on my entry about Go as a 'Python with performance' for me, I was asked about my views on using PyPy for this. I flailed around a bit in my reply there and since then I've been thinking about it more, so it's time to go on at more length.

The simple version is that today I think of PyPy as perhaps a way to make some Python programs go faster but not as a way to write fast Python programs. If I have an existing Python program that fit what I think of as the PyPy profile (long-running, generally does basic operations, and I'm indifferent to memory usage) and I absolutely needed it to go faster, I'd consider feeding it to PyPy to see what'd happen. If it speeds up without exploding the memory usage, I've won and I can stop. If that doesn't work, well, time for other measures. However, PyPy is too unpredictable to me for me to be able to write Python code that I can count on it speeding up dramatically, especially if I also want to control the memory usage and so on.

There are other pragmatic issues with using it. For a start, the version of PyPy available to me through distribution packages varies widely from system to system here and with that variance I can expect an equally large performance variance. The current version of PyPy is 2.2.1 while Fedora 19 has 2.1.0 and Ubuntu 12.04 LTS is back at 1.8. Beyond that, a certain amount of interesting Python environments just don't work with PyPy; for example, I can't use PyPy to speed up parts of a Django app deployed through mod_wsgi (not that the app is likely to have a performance bottleneck anyways, that's an illustration).

There's also two serious problems with PyPy today that make it far less interesting for me (at least as of the Fedora 19 version of 2.1.0). The first is what I alluded to above; PyPy has a significant startup delay before it starts speeding up your program and thus doesn't really speed up short running things. I'm pretty sure that if I had a Python program that ran in a second, PyPy wouldn't speed it up very much. The second is that PyPy quietly explodes on common Python idioms under some circumstances.

For an example that I have personally run into, consider:

data = open("/some/file", "r").read()

This is a not uncommon Python idiom to casually read in a file. If you try this in a PyPy-run program in any sort of situation where you do this repeatedly, you'll probably wind up with a 'too many open files' error before too long. In straight (C)Python the open file is immediately garbage collected at the end of the .read(); in PyPy, it seems to hang around (presumably for a full garbage collection run) and with it the open file descriptor. Boom.

Yes, yes, you say, this is bad style. The reality is that this 'bad style' is common in Python, as are other examples where code assumes that dropped or out of scope objects will be immediately garbage collected. I don't want to spend my time troubleshooting mysterious problems in otherwise reliable long-running Python programs that only appears when I run them under PyPy. Not running them under PyPy is by far the easier solution, even if it costs me performance.

(In my opinion non-deterministic garbage collection is actually a serious problem, but that's another entry.)

Written on 10 December 2013.
« Hardware is weird (disk enclosure edition)
The problem with nondeterministic garbage collection »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Dec 10 00:49:07 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.