Some notes from my brief experience with the Grumpy transpiler for Python
I've been keeping an eye on Google's Grumpy Python to Go transpiler more or less since it was introduced because it's always been my great white hope for speeding up my Python code more or less effortlessly (and I like Go). However, until recently I had never actually tried to do anything much with it because I didn't really have a problem that it looked like a good fit for. What changed is that I finally got hit by the startup overhead of small programs.
As mentioned in that entry, my initial attempts to use Grumpy weren't successful, because how to actually use Grumpy for anything beyond toys is basically not documented today. Because sometimes I'm stubborn, I kept banging my head against the wall for long enough until I hacked together how to bring up my program, which gave me the chance to get some real world results. Basically the process went like this:
- build Grumpy from source following their 'method 2' process (using the Fedora 25 system version of Go, not my own build, because Grumpy very much didn't work with the latter).
- have Grumpy translate my Python program into a module, which
was possible because I'd kept it
grumprunto not delete the Go source file it creates on the fly based on your input.
grumprunis in Python, which makes this reasonably easy.
grumpruna Python program that was '
import mymodule; mymodule.main()' and grab the Go source code it generated (now that it wasn't deleting said source code afterward). This gave me a Go program that I could build into a binary that I could keep and then run with command line arguments.
Unfortunately it turns out that this didn't do me any good. First, the
compiled binary of my Grumpy-transpiled Python code also took about
the same 0.05 of a second to start and run as my real Python code.
Second, my code immediately failed because Grumpy has not fully
set()s; in particular, it doesn't have the
.difference() method. This is not listed in their Missing
wiki page, but Grumpy is underdocumented in general.
(As a general note, Grumpy appears to be in a state of significant churn in how it operates and how you use it, which I suppose is not particularly surprising. You can find older articles on how to use Grumpy that clearly worked at the time but don't work any more.)
This whole experience has unfortunately left me much less interested in Grumpy. As it is today, Grumpy's clearly not ready for outside people to do anything with it, and even in the future it may well never be good at the kind of things I want it for. Building fast-starting and fast-running programs may not ever be a Grumpy priority. Grumpy is an interesting experiment and I wish Google the best of luck with it, but it clearly can't be my great hope for faster, lighter-weight Python programs.
My meta-view of Grumpy is that right now it feels like an internal Google (or Youtube) tool that Google just happens to be developing in a public repository for us to watch.
(In this particular case my fix was to hand-write a second version of the program in Go, which has been part irritating and part interesting. The Go version runs in essentially no time, as I wanted and hoped, so the slow startup of the Grumpy version is not intrinsic to either Go or the problem. My Go version will not be the canonical version of this program for local reasons, so I'll have to maintain it myself in sync with the official Python version for as long as I care enough to.)
Sidebar: Part of why Grumpy is probably slow (and awkward)
It's an interesting exercise to look at the Go code that
generates. It's not anything like Go code as you'd conventionally
write it; instead, it's much closer to CPython bytecode that has been turned into Go code. This
faithfully implements the semantics of (C)Python, which is explicitly
one of Grumpy's goals, but it means that Grumpy has a significant
amount of overhead over a true Go solution in many situations.
(The transpiler may lower some Python types and expressions to more pure Go code under some circumstances, but scanning the generated output for my Python program suggests that this is uncommon to rare in the kind of code I write.)
Grumpy codes various Python types in pure Go
code, but as I found with
set, some of their implementations are
incomplete. In fact, now that I look I can see that the only Go
code in the entire project appears to be in those types, which
generally correspond to things that are implemented in C in CPython.
Everything else is generated by the transpiling process.
I've been hit by the startup overhead of small programs in Python
I've written before about how I care about the resource usage and speed of short running programs. However, that care has basically been theoretical. I knew this was an issue in general and it worried me because we have short running Python programs, but it didn't impact me directly and our systems didn't seem to be suffering as a result of it. Even DWiki running as a CGI was merely kind of embarrassing.
Today, I turned a hacky personal shell script into a better done
production ready version that I rewrote in Python. This worked fine
and everything was great right up to the point where I discovered
that I had made this script a critical path in invoking
dmenu on my office workstation, which is something
that I do a lot (partly because I have a very convenient key
binding for it). The new Python version
is not slow as such, but it is slower, and it turns out that I am
very sensitive to even moderate startup delays with
because I type ahead, expecting
dmenu to appear essentially
instantly). With the old shell script version, this part of
startup took around one to two hundredths of a second; with the new
Python version, things now takes around a quarter of a second, which
is enough lag to be perceptible and for my type-ahead to go awry.
(This assumes that my machine is unloaded, which is not always the
case. Active CPU load, such as installing Ubuntu in a test VM, can
make this worse. My
dmenu setup actually runs this program five
times to extract various information, so each individual run is
taking about five hundredths of a second.)
Profiling and measuring short running Python programs is a bit
challenging and I've wound up resorting to fairly crude tricks (such
as just exiting from the program at strategic points). These tricks
strongly suggest that almost all of the extra time is going simply
to starting Python, with a significant amount of it spent importing
the standard library modules I use (and all of the things that they
import in turn). Simply getting to the quite early point where I
parse_args ArgumentParser method consumes almost all of the
time on my desktop. My own code contributes relatively little to
the slower execution (although not nothing), which unfortunately
means that there's basically no point in trying to optimize it.
(On the one hand, this saves me time. On the other hand, optimizing Python code can be interesting.)
My inelegant workaround for now is to cache the information my program is producing, so I only have to run the program (and take the quarter second delay) when its configuration file changes; this seems to work okay and it's as least as fast as the old shell script version. I'm hopeful that I won't run into any other places where I'm using this program in a latency sensitive situation (and anyway, such situations are likely to have less latency since I'm probably only running it once).
In the longer run it would be nice to have some relatively general solution to pre-translate Python programs into some faster to start form. For my purposes with short running programs it's okay if the translated result has somewhat less efficient code, as long as it starts very fast and thus finishes fast for programs that only run briefly. The sort of obvious candidate is Google's grumpy project; unfortunately, I can't figure out how to make it convert and build programs instead of Python modules, although it's clearly possible somehow.
PS: The new version of the program is written in Python instead of shell because a non-hacky version of the job it's doing is more complicated than is sensible to implement in a shell script (it involves reading a configuration file, among other issues). It's written in Python instead of Go for multiple reasons, including that we've decided to standardize on only using a few languages for our tools and Go currently isn't one of them (I've mentioned this in a comment on this entry).
Why I care about Apache's mod_wsgi so much
I made a strong claim yesterday in an aside: I said that Apache with mod_wsgi is the easiest and most seamless way of running a Python WSGI app, and thus it was a pity that it doesn't support using PyPy for this. As I have restarted it here this claim is a bit too strong, so I have to start by watering it down. Apache with mod_wsgi is definitely the easiest and most seamless way to run a Python WSGI app in a shared (web) environment, where you have out a general purpose web server that handles a variety of URLs and services. It may also be your best option if the only thing the web server is doing is running your WSGI application, but I don't have any experience with such environments.
(I focus on shared web environments because none of my WSGI apps are likely to ever be so big and so heavily used that I need to devote an entire web server to them.)
Apache is a good choice as a general purpose web server in the first place, and once you have Apache, mod_wsgi makes deploying a WSGI application pretty straightforward. Generally all you need is a couple of lines of Apache configuration, and you can even arrange to have your WSGI application run under another Unix UID if you want (speaking as a sysadmin, that's a great thing; I would like as few things as possible to run as the web server UID). There's no need to run, configure, and manage another daemon, or to coordinate configuration changes between your WSGI daemon and your web server. Do you want to reload your app's code? Touch a file and it happens, you're done. And all of this lives seamlessly alongside everything else in the web server's configuration, including other WSGI apps also being handled through mod_wsgi.
As far as I know, every other option for getting a WSGI app up and running is more complicated, sometimes fearsomely so. I would like an even simpler option, but until such a thing arrives, mod_wsgi is as close as I can get (and it works well even in unusual situations).
I care about WSGI in general because it's the broadly right way to deploy a Python web app. The easier and simpler it is to deploy a WSGI app, the less likely I am to just write my initial simple version of something as a CGI and then get sucked into very peculiar lashups.
If you're going to use PyPy, I think you need servers
I have a long-standing interest in PyPy for the straightforward reason that it certainly would be nice to get a nice performance increase for my Python code basically for free, and I do have some code that is at least somewhat CPU-intensive. Also, to be honest, the idea and technology of PyPy is really neat and so I would like it to work out.
Back a few years ago when I did some experiments, one of the drawbacks of PyPy for my sort of interests was that it took a substantial amount of execution time to warm up and start performing better than CPython. I just gave the latest PyPy release a quick spin (using this standalone package for Linux (via)), and while it's faster than previous versions it still has that warm-up requirement, which is neither unexpected nor surprising (and in fact the PyPy FAQ explicitly talks about this). But this raises a question; if I want to use PyPy to speed up my Python code, what would it take?
If PyPy only helps on long running code, then that means I need to run things as servers instead of one-shot programs. This is doable; almost anything can be recast as a server if you try hard enough (and perhaps write the client in another, lighter weight language). However it's not enough to just have, say, a preforking server where the actual worker processes only do a bit of work and then die off, because that doesn't get you the long running code that PyPy needs. Instead you need either long running worker processes or threads within a single server process, and given Python's GIL you probably want the former.
(And yes, PyPy still has a GIL.)
A straightforward preforking server is going to duplicate a lot of warm-up work in each worker process, because the main server process doesn't do very much work on its own before it starts worker processes. I can imagine hacks to deal with this, such as having the server go through a bunch of synthetic requests before it starts forking off workers to handle real ones. This might have the useful side effect of reducing the overall memory overhead of PyPy by sharing more JIT data between worker processes. It does require you to generate synthetic requests, which is easy for me in one environment but not so much so for another.
There is one obvious server environment that's an entirely natural fit for running Python code easily, and would in fact easily handle DWiki (the code behind this blog). That is Apache with mod_wsgi, which transparently runs your Python WSGI app in some server processes. Unfortunately, as far as I know mod_wsgi doesn't support PyPy and I don't think there are any plans to change that.
(There are other ways to run WSGI apps using PyPy, but none of them are as easy and seamless as Apache with mod_wsgi and thus all of them are less interesting to me.)
Python's complicating concept of a callable
Through Planet Python I recently wound up reading this article on Python decorators. It has the commendable and generally achieved goal of a clear, easily followed explanation of decorators, starting out by talking about how functions can return other functions and then defining decorators as:
A decorator is a function (such as
foobarin the above example) that takes a function object as an argument, and returns a function object as a return value.
Some experienced Python people are now jumping up and and down to say that this definition is not complete and thus not correct. To be complete and correct, you have to change function to callable. These people are correct, but at the same time this correctness creates a hassle in this sort of explanation.
In Python, it's pretty easy to understand what a function is. We have
an intuitive view that pretty much matches reality; if you write '
foobar(...):' (in module scope), you have a function. It's not so easy
to inventory and understand all of the things in Python that can be
callables. Can you do it? I'm not sure that I can:
- functions, including lambdas
- objects if they're instances of a class with a
- bound methods of an object (eg
- methods of classes in general, especially class methods and static methods
(I don't think you can make modules callable, and in any case it would be perverse.)
Being technically correct in this sort of explanation exacts an annoying price. Rather than say 'function', you must say 'callable' and then probably try to give some sort of brief and useful explanation of just what a callable is beyond a function (which should cover at least classes and callable objects, which are the two common cases for things like decorators). This is inevitably going to complicate your writing and put at least a speed bump in the way for readers.
The generality of Python accepting callables for many things is important, but it does drag a relatively complicated concept into any number of otherwise relatively straightforward explanations of things. I don't have any particular way to square the circle here; even web specific hacks like writing function as an <ABBR> element with its own pseudo-footnote seem kind of ugly.
(You could try to separate the idea of 'many things can be callable' from the specifics of just what they are beyond functions. I'm not sure that would work well, but I'm probably too close to the issue; it's quite possible that people who don't already know about callables would be happy with that.)
Why we're not running the current version of Django
We have a small web app that uses Django (or is based on Django, depending on your perspective). As I write this, the latest version of Django is 1.11.2, and the 1.11 series of releases started back in April. We're running Django 1.10.7 (I'm actually surprised we're that current) and we're probably going to keep on doing that for a while.
(Now that I'm looking at this, I see that Django 1.10 will get security fixes through December (per here), so we'll probably stick with it until the fall. Summer is when new graduate students show up, which means that it's when the app is most important and most heavily used.)
On the one hand, you might ask what's the problem with this. On the other hand, you might ask why we haven't updated. As it turns out, both questions wind up in the same place. The reason I don't like being behind on significant Django version updates is that the further behind we get, the more work we have to do all at once to catch up with Django changes (not all of which are clear and well documented, and some of which are annoying). And the reason we're behind is exactly that Django keeps changing APIs from version to version (including implicit APIs).
The net result is that Django version updates are not drop in things. Generally each of them is a not insignificant amount of work and time. At a minimum I have to read a chunk of the release notes very carefully (important things are often mentioned in passing, if at all), frequently I need code changes, and sometimes Django deprecates an API in a way that leaves me doing a bunch of research and programming to figure out what to replace it with (my notes say that this is the case for 1.11). Our small web app otherwise needs basically no attention, so Django version updates are a real pain point.
At the same time, I've found out the hard way that if I start delaying this pain for multiple version updates, it gets much worse. For a start, I fast-forward through any opportunity to get deprecation warnings; instead a feature or an API can go straight from working (in our current version) to turned off. I also have to absorb and understand all of the changes and updates across multiple versions at once, rather than getting to space them out bit by bit. So on the whole it goes better to go through every Django version.
I don't expect this to ever change. I don't think the Django developers have any sort of policy about how easy it should be to move from, say, one Django long term support release to another. And in general I don't expect them to ever get such a policy that would significantly slow them down. The Django developers clearly like the freedom to change their APIs relatively fast, and the overall Django community seems to be fine with it. We're outliers.
(This issue is one reason to not stick with the version of Django that's supplied by Ubuntu. Even Ubuntu 16.04 only packages Django 1.8. I really don't want to think about what it would be like to jump forward four years in Django versions when we do our typical 'every second Ubuntu LTS release' update of the web server where our Django app runs. Even Ubuntu 14.04 to Ubuntu 16.04 is a jump from Django 1.6 to Django 1.8 all at once.)
The Python Gilectomy project's performance problem
After my recent entries on (C)Python's Global Interpreter Lock (GIL), Kevin Lyda asked me on Twitter if I'd read the latest update on Progress on the Gilectomy. For those of us who either haven't heard of this before or have forgotten about it, the 'Gilectomy' is Larry Hastings' ongoing project to remove the GIL from CPython. As summarized in the LWN article, his goals are:
[Larry Hastings] wants to be able to run existing multi-threaded Python programs on multiple cores. He wants to break as little of the existing C API as possible. And he will have achieved his goal if those programs run faster than they do with CPython and the GIL—as measured by wall time.
With a certain amount of hand waving, these are worthwhile goals for multi-threaded Python code and I don't think anyone would object. The project is also technically neat and interesting, although it appears to be more interesting from an engineering perspective (taking existing research and applying it to Python) than from a research perspective (coming up with new GC approaches). There's nothing wrong with this, of course; computer science stands on the shoulders of plenty of giants, so it's to our benefit to look down on a regular basis. If Larry Hastings can deliver his goals in a modified version of CPython that is solidly implemented, he'll have done something pretty impressive.
However, I agree with Guido van Rossum's view (as reported in LWN's 2016 Gilectomy article) that this should not become part of regular CPython unless existing single-threaded Python programs still run at full speed, as fast as they do now. This may seem harsh, given that a successful Gilectomy would definitely speed up multi-threaded programs, but here is where theory runs up against the reality of backwards compatibility.
The reality is that most current Python programs are single-threaded programs or ones that are at most using threading for what the GIL makes it good for. This is a chicken and egg issue, of course; the GIL made Python only good for this, so this is mostly how people wrote Python programs. You have to be at least somewhat perverse to write multi-threaded code knowing that multi-threading isn't really helping you; unsurprisingly, not many people are perverse and so probably not much of this code exists. If a new version of CPython sped up this uncommon 'perverse' code at the expense of slowing down common single-threaded code, most people would probably have their Python code slow down and would consider the new version a bad change.
I further believe that this holds not just for wall clock time but also for CPU usage. A version of CPython that required extra cores to keep existing programs running just as fast in wall clock time is a version that has slowed down in practice, because not all systems actually have those extra cores sitting idle, waiting to be used.
(There are also pragmatic issues with the CPython API for C extensions. For a start, you had better make it completely impossible for existing C extensions to create multi-threaded races if they are just recompiled for the new CPython. This may require deliberately dropping certain existing APIs because they cannot be made multi-thread safe and forcing extensions to be updated to new ones, which will inevitably result in a bunch of existing extensions never getting updated and users of them never updating either.)
On the whole I'm not optimistic for a Gilectomy to arrive before a hypothetical 'Python 4' that can break many things, including user expectations (although I'd love to be surprised by a Gilectomy that keeps single-threaded performance fully intact). I further suspect that Python's developers don't have much of an appetite for going through a Python 3 experience all over again, at least not any time soon. It's possible that a Gilectomy'd CPython could be maintained in parallel to the current regular CPython, which would at least give users a choice.
(Years ago I wrote an entry in praise of the GIL and I still basically stand by those views even in this increasingly multi-core world. Shared memory multi-threaded programming is still a bad programming model, even if CPython guarantees that your program will never dump core. But at this point it's probably too late to add a different model to multi-threaded (C)Python.)
Exploiting Python's Global Interpreter Lock for atomic operations is fun
Yesterday I wrote a sober and respectable article on how using the GIL safely has various tricky traps and you probably shouldn't try to do it. This is indeed the sensible, pragmatic approach to exploiting what the GIL protects for threaded Python code; you probably don't need that much performance and so you should use some form of explicit locking instead of trying to be clever.
The honest truth is that I've exploited the GIL this way before and I'll probably write more Python code that does it in the future. This isn't because it's necessarily a good idea; it's because the GIL is fun. Like many other complicated parts of CPython, the GIL is simply neat all by itself (at least if you're the kind of person who likes to peer behind the curtain and know how things work). And in general it's an important part of threaded CPython, with a lot of complexity that is important to understand if you want to write well-performing threaded Python code. You can't avoid learning something about the GIL's technicalities if you care about this area.
Once you have learned enough about the GIL, you have a Turing tarpit style puzzle of figuring out how to map your problem onto Python operations that are what I'll call GIL atomic, things that will never have races because the GIL is held across them. This sort of thing is like catnip for me and many programmers, especially since it requires deep domain knowledge. It make us feel smart, we're solving tough problems creatively and with real benefits (after all, this is the high performance option), and it means all of our domain knowledge gets put to nominally good use. It's simply fun to make CPython do what I want here, without the overhead and typing and work of setting up mutexes or managing locks or similar measures. It's magic; you do ordinary operations and things just work because you designed everything right.
(Then I can make myself feel even more clever by writing a long comment about how and why what I'm doing is perfectly safe.)
So that's my awkward confession about exploiting the GIL. It would be completely sensible to avoid doing so, but I'm probably not going to. Either I'll justify it on the grounds that this is a personal project and I'm allowed to have fun, or I'll justify it on the grounds of performance, but the truth is that I want to do it because it's neat and fun.
(If it was boring and tedious and took more work to arrange to exploit the GIL this way, you can bet that I'd be using mutexes and locks all the time.)
Safely using Python's Global Interpreter Lock is quite tricky and subtle
Recently I ran across Grok the GIL: How to write fast and thread-safe Python (via), by A. Jesse Jiryu Davis. In the original version of the article, A. Jesse wrote:
Even though the line
lst.sort()takes several steps, the
sortcall itself is a single bytecode, and thus there is no opportunity for the thread to have the GIL seized from it during the call. We could conclude that we don't need to lock around
In comments on that article, Ben Darnell pointed out that this is not necessarily true. The fundamental issue is that Python objects are allowed to customize how they are compared for sorting, hashed for things like dictionary insertions, and so on. These operations necessarily require running Python code, and the moment you start running Python code the GIL may be seized from you.
The version of the article on A. Jesse's personal site has been updated to note some qualifications:
[...] so long as we haven’t provided a Python callback for the
keyparameter, or added any objects with custom
If you guessed that these qualifications are not quite complete,
you would be correct; they're missing the rich comparison operations,
which may be defined instead of
__cmp__ (and in Python 3, they're
the only way to customize comparisons and ordering).
This may seem like nit-picking and in a sense it is. But what I
really want to illustrate here is just how tricky it is to safely
use the GIL without locking. In CPython there is a huge number of
special cases where Python code can start running in what appear
to be simple atomic C-level operations like
these special cases requires a great deal of up to date Python
knowledge (so you know all the various ways things can be customized),
then finding out if they're going to affect any particular operation
you want to do may require deep dives into the source code you're
dealing with. This is especially the case if the code uses things
like metaclasses or class decorators.
In practice things get worse, because what you've looked at is a
single point in time of your codebase; you only know that it's safe
today. Nothing fundamentally prevents someone coming along later
to helpfully add some rich comparison operators to a class that
invalidate your assumptions that
lst.sort() will be an atomic
operation as far as the GIL is concerned and so you don't need to
do any locking around it.
(If everyone involved is lucky, the class comments document that other code depends on the class not having rich comparison operators or whatever else you need it to not have.)
So what I've wound up feeling is that GIL safety is generally too complicated and tricky to use. Or perhaps it would be better to say that it's too fragile, since there are a vast number of ways to accidentally destroy it without realizing what you've done. If you actively want to use GIL safety to avoid explicit locking, it's probably going to be one of the trickier portions of your code and you should be very careful to document everything about it and to keep it as simple as possible (for example, using only primitive C-level types even if this requires contortions).
It's unfortunate that the GIL is this way, but it is (at least for now in CPython and thus probably for the future).
(In theory CPython could be augmented so that operations like
lst.sort() explicitly held the GIL for their entire duration so
that people wouldn't get surprised this way. But I suspect that the
CPython developers want people to use explicit locking, mutexes,
and so on, and not rely on hard to explain GIL guarantees that
constrain their implementation choices.)
What I mostly care about for speeding up our Python programs
There are any number of efforts and technologies around these days that try to speed up Python, starting with the obvious PyPy and going on to things like Cython and grumpy. Every so often I think about trying to apply one of them to the Python code I deal with, and after doing this a few times (and even making some old experiments with PyPy) I've come to a view of what's important to me in this area.
What I've come around to caring about most is reducing the overall resource usage of short running programs that mostly use the Python standard library and additional pure-Python modules. By 'resource usage' I mean a combination of both CPU usage and memory usage; in our situation it's not exactly great if I make a program run twice as fast but use four times as much memory. In fact for some programs I probably care more about memory usage than CPU, because in practice our Python-based milter system probably spends most of its time waiting for our commercial anti-spam system to digest the email message and give it a verdict.
(Meanwhile, our attachment logger program is probably very close to being CPU bound. Yes, it has to read things off disk, but in most cases those files have just been written to disk so they're going to be in the OS's disk cache.)
I'm also interested in making DWiki (the code behind Wandering Thoughts) faster, but again I actually want it to be less resource-intensive on the systems it runs on, which includes its memory usage too. And while DWiki can run in a somewhat long-running mode, most of the time it runs as a short-lived CGI that just serves a single request. DWiki's long-running daemon mode also has some features that might make it play badly with PyPy, for example that it's a preforking network server and thus that PyPy is probably going to wind up doing a lot of duplicate JIT translation.
I think that all of this biases me towards up-front approaches like Cython and grumpy over on the fly ones such as PyPy. Up-front translation is probably going to work better for short running programs (partly because I pay the translation overhead only once, and in advance), and the results are at least reasonably testable; I can build a translated version and see in advance whether the result is probably worth it. I think this is a pity because PyPy is likely to be both the easiest to use and the most powerful accelerator, but it's not really aimed at my usage case.
(PyPy's choice here is perfectly sensible; bigger, long-running programs that are actively CPU intensive for significant periods of time are where there's the most payoff for speeding things up.)
PS: With all of this said, if I was serious here I would build the latest version of PyPy by hand and actually test it. My last look and the views I formed back then were enough years ago that I'm sure PyPy has changed significantly since then.