Code stability in my one Django web application
We have one Django web application, a system for automating the handling of much of our new Unix account requests. It was started in early 2011 (using Django 1.2) and I did a retrospective at the end of 2014 where I called it a faithful web app, one that had just kept on quietly working without problems. That's continued through to today; the app needs no routine attention, although every so often I tweak it to better handle an obscure situation.
One of the interesting aspects of that quiet stability is the relative stability of the application's Python code over those nearly six years so far. There are web frameworks where in six years you'd need to significantly rework and restructure your code to deal with changing APIs and approaches. For us, Django hasn't been one of them. Although we're not quite current on Django versions, we're not that far back, yet much of the code is basically the same (or literally the same) as it started out all those years ago. I'm pretty sure that almost all of our model and view code is untouched over that time, and I think a lot of our templates are untouched or only minorly changed.
However, this is not a complete picture of code churn in our app, because there have been Django changes over that time in areas such as routing, command argument processing, template processing, and project structure. These changes have forced code changes in the areas of our app that deal with such things (and the change in project structure eventually forced a massive renaming of files when we went to Django 1.9). While this sounds kind of bad, I've wound up considering all of them to be relatively peripheral. In a way, all of the code involved is plumbing and glue. None of it really touches the heart of our web application, which (for us) lives mostly in the models and views and somewhat in the core logic of the templates. Django has been very good about keeping that core code from needing any substantive changes. We still validate form submissions and generate views and process model data in basically the same way we did in 2011, and all of that is what I think of as the hard stuff.
(Although I haven't measured, I think also it's most of the app's code by line count.)
This code stability is one reason why Django upgrades have been somewhat painful but not deeply painful. If we'd needed major code restructuring, well, I'd probably have done it eventually because we might have had no choice, but we'd have likely updated Django versions more sporadically than we have so far.
PS: Although Django is going from version 1.11 to version 2.0 in the next release, the Django people say that this shouldn't be any more of an upgrade than usual. And speaking of that. I should get working on updating us to 1.11, since security updates for 1.10 will end soon (if they haven't already).
collections.defaultdict is good for your memory usage
There is a classical pattern in code that uses entries in dictionaries to accumulate data. In the simplest form, it looks like this:
e = dct.get(ky, None) if e is None: e =  dct[ky] = e # now we work on e without # caring if it's new or old
There is an obvious variation of this that gets rid of the whole
bureaucracy involving the
e = dct.setdefault(ky, ) # work on e
On the surface, this looks very much like what you get with
At this level you might reasonably think that
defaultdict is just
a convenience, giving you a slightly shorter and nicer way to write
this code so you don't have to do either the
if or use
instead of just doing a simple
dct[ky]. However, there's an
important way that both
defaultdict and the
are better than the
To see it, let's change what the individual elements are:
e = dct.setdefault(ky, ExpensiveItem()) ....
When I write things this way, the problem may jump out right away.
The issue with this version is that we always create a new
ExpensiveItem object regardless of whether
ky is already in
ky is not in
dct, we use the new object and all is
good, but if there already is one, we throw away the new object we
created. If we're dealing with a lot of keys that already exist,
this is a lot of objects being created and then immediately thrown
away. Both the
if-based version and
defaultdict avoid this
problem because they only create a new object if and when they
actually need it, and a
defaultdict version is just as short as
(The other subtle advantage of
defaultdict is that you specify
the default item only once, when you create the dictionary, instead
of having to duplicate it in every section of code where you need
to do this update-or-add pattern.)
On the one hand, this advantage of
defaultdict feels obvious once
I write it out like this. On the other hand, Python doesn't really
encourage people to think about how often objects are created and
other aspects of memory churn. Also, even if you know about the
issue (as I generally do), it's tempting to go with the
version instead of the
if version just because it's shorter and
you probably aren't dealing with enough objects for this to matter.
collections.defaultdict lets you have your cake and eat it
too; you get short code and memory efficiency.
I still like Python and often reach for it by default
Various local events recently made me think a bit about the future of Python at work. We're in a situation where a number of our existing tools will likely get drastically revised or entirely thrown away and replaced, and that raises local issues with Python 3 as well as questions of whether I should argue for changing our list of standard languages. I have some technical views on the answer, but thinking through this has made me realize something on a more personal level. Namely, I still like Python and it's my go-to default language for a number of things.
I'm probably always going to be a little bit grumpy about the whole transition toward Python 3, but that in no way erases the good parts of Python. Despite the baggage around it, Python 3 has its own good side and I remain reasonably enthused about it. Writing modest little programs in Python has never been a burden; the hard parts are never from Python, they're from figuring out things like data representation and that's the same challenge in any language. In the mean time, Python's various good attributes make it pretty plastic and easily molded as I'm shaping and re-shaping my code as I figure out more of how I want to do things.
(In other words, experimenting with my code is generally reasonably easy. When I may completely change how I approach a problem between my first draft and my second attempt, this is quite handy.)
Also, Python makes it very easy to do string-bashing and to combine it with basic Unix things. This describes a lot of what I do, which means that Python is a low-overhead way of writing something that is much like a shell script but that's more structured, better organized, and expresses its logic more clearly and directly (because it's not caught up in the Turing tarpit of Bourne shell).
(This sort of 'better shell script' need comes up surprisingly often.)
My tentative conclusion about what this means for me is that I should embrace Python 3, specifically I should embrace it for new work. Despite potential qualms for some things, new program that I write should be in Python 3 unless there's a strong reason they can't be (such as having to run on a platform with an inadequate or missing Python 3). The nominal end of life for Python 2 is not all that far off, and if I'm continuing with Python in general (and I am), then I should be carrying around as little Python 2 code as possible.
Some thoughts on having both Python 2 and 3 programs
Earlier, I wrote about my qualms about using Python 3 in (work) projects in light of the extra burden it might put on my co-workers if they had to work on the code. One possible answer here is that it's possible both to use Python 3 features in Python 2 and to write code that naturally runs unmodified under both versions (as I did without explicitly trying to). This is true, but there's a catch and that catch matters in this situation.
The compatibility between Python 2 and Python 3 is not symmetric.
If you write natural Python 3 code, it can often run under Python
2, sometimes with __future__ imports. However, if you write
natural Python 2 code it will not run under Python 3, unless your
code completely avoids at least
Since there are Python 3 features that are simply not available in
Python 2 even with __future__ imports, a Python 3 programmer can
still wind up blowing up a Python 2 program. But as someone who's now
written both Python 2 and Python 3 code (including some that wound up
being valid Python 2 code too), my feeling is that you have to go at
least a bit out of your way in straightforward code to wind up doing
this. By contrast, it's very easy for a Python 2 programmer to use
Python 2 only things in code, partly because one of them (
So if you have part-time Python 3 programmers and some Python 2
programs, you'll probably be fine (and you can increase the odds
by putting __future__ imports into the Python 2 programs in
advance, so they're fully ready for Python 3 idioms like
print() as a
function). If you have part-time Python 2 programmers and some Python 3
programs, you're probably going to have to keep an eye on things; people
may get surprises every so often. Unfortunately there's nothing you can
really do to make the Python 3 code able to deal with Python 2 idioms
(In the long run it seems clear that everyone is going to have to learn about Python 3, but that's another issue and problem. I suspect that many places are implicitly deferring it until they have no choice. I look forward to an increasing number of 'what to know about Python 3 for Python 2 programmers' articles as we approach 2020 and the theoretical end of Python 2 support.)
My potential qualms about using Python 3 in projects
I wrote recently about why I didn't use the
attrs module recently; the short version is that it would have
forced my co-workers to learn about it in order to work on my code.
Talking about this brings up a potentially awkward issue, namely
Python 3. Just like the
attrs module, working with Python 3 code
involves learning some new things and dealing with some additional
concerns. In light of this, is using Python 3 in code for work
something that's justified?
This issue is relevant to me because I actually have Python 3 code these days. For one program, I had a concrete and useful reason to use Python 3 and doing so has probably had real benefits for our handling of incoming email. But for other code I've simply written it in Python 3 because I'm still kind of enthused about it and everyone (still) does say it's the right thing to do. And there's no chance that we'll be able to forget about Python 2, since almost all of our existing Python code uses Python 2 and isn't going to change.
However, my tentative view is that using Python 3 is a very different
situation than the
attrs module. To put it one way, it's quite
possible to work with Python 3 without noticing. At a superficial
level and for straightforward code, about the only difference between
Python 3 and Python 2 is
print("foo") versus '
Although I've said nasty things about Python 3's automatic string
conversions in the past, they do have the
useful property that things basically just work in a properly formed
UTF-8 environment, and most of the time that's what we have for
(Yes, this isn't robust against nasty input, and some tools are exposed to that. But many of our tools only process configuration files that we've created ourselves, which means that any problems are our own fault.)
Given that you can do a great deal of work on an existing piece of
Python code without caring whether it's Python 2 or Python 3, the
cost of using Python 3 instead of Python 2 is much lower than, for
example, the cost of using the
attrs module. Code that uses
is basically magic if you don't know
attrs; code in Python 3 is
just a tiny bit odd looking and it may blow up somewhat mysteriously
if you do one of two innocent-seeming things.
(The two things are adding a
In situations where using Python 3 allows some clear benefit, such as using a better version of an existing module, I think using Python 3 is pretty easily defensible; the cost is very likely to be low and there is a real gain. In situations where I've just used Python 3 because I thought it was neat and it's the future, well, at least the costs are very low (and I can argue that this code is ready for a hypothetical future where Python 2 isn't supported any more and we want to migrate away from it).
Sidebar: Sometimes the same code works in both Pythons
I wrote my latest Python code as a Python 3
program from the start. Somewhat to my surprise, it runs unmodified
under Python 2.7.12 even though I made no attempt to make it do so.
Some of this is simply luck, because it turns out that I was only
print() with a single argument. In Python 2,
print("fred") is seen as '
print ("fred")', which is just
print "fred"', which works fine. Had I tried to
multiple arguments, things would have exploded.
(I have only single-argument
print()s because I habitually
format my output with
% if I'm printing out multiple things.
There are times when I'll deviate from this, but it's not common.)
Why I didn't use the
attrs module in a recent Python project
I've been hearing buzz about the
attrs Python module for a while (for example). I was recently
writing a Python program where I had some structures and using
attrs to define the classes
involved would have made the code shorter and more obvious. At
first I was all fired up to finally use
attrs, but then I took a
step back and reluctantly decided that doing so would be the wrong
You see, this was code for work, and while my co-workers can work
in Python, they're not Python people in the way that I am. They're
certainly not up on the latest Python things and developments; to
them, Python is a tool and they're happy to let it be if they don't
need to immerse themselves in it. Naturally, they don't know anything
If I used
attrs, the code would be a bit shorter (and it'd be
neat to actually use it), but my co-workers would have to learn at
least something about
attrs before they could understand my code
to diagnose problems, make changes, or otherwise work on it. Using
straightforward structure-style classes is boring, but it's not
that much more code and it's code that's using a familiar, well
established idiom that pretty much everyone is already familiar
Given this situation, I did the responsible thing and decided that
my desire to play around with
attrs was in no way a sufficient
justification for inflicting another Python module to learn on my
co-workers. Boring straightforward code has its advantages.
I can think of two things that would change this calculation. The
first is if I needed more than just simple structure-style classes,
attrs was saving me a significant chunk of code and making
the code that remained much clearer. If I come out clearly ahead
attrs even after adding explanatory comments for my co-workers
(or future me), then
attrs is much more likely to be a win overall
instead of just an indulgence.
(I think that the amount of usage and the size of the codebase matters too, but for us our codebases are small since we're just writing system utility programs and so on in Python.)
The second is if
attrs usage becomes relatively widespread, so that
my co-workers may well be encountering it in other people's Python code
that we have to deal with, in online documentation, and so on. Then
attrs would add relatively little learning overhead and might
even have become the normal idiom. This is part of why I feel much more
free to use modules in the standard library than third-party modules;
the former are, well, 'standard' in at least some sense.
(Mind you, these days I'm sufficiently out of touch with the Python
world that I'm not sure how I'd find out if
attrs was a big, common
thing. Perhaps if Django started using and recommending it.)
Some notes from my brief experience with the Grumpy transpiler for Python
I've been keeping an eye on Google's Grumpy Python to Go transpiler more or less since it was introduced because it's always been my great white hope for speeding up my Python code more or less effortlessly (and I like Go). However, until recently I had never actually tried to do anything much with it because I didn't really have a problem that it looked like a good fit for. What changed is that I finally got hit by the startup overhead of small programs.
As mentioned in that entry, my initial attempts to use Grumpy weren't successful, because how to actually use Grumpy for anything beyond toys is basically not documented today. Because sometimes I'm stubborn, I kept banging my head against the wall for long enough until I hacked together how to bring up my program, which gave me the chance to get some real world results. Basically the process went like this:
- build Grumpy from source following their 'method 2' process (using the Fedora 25 system version of Go, not my own build, because Grumpy very much didn't work with the latter).
- have Grumpy translate my Python program into a module, which
was possible because I'd kept it
grumprunto not delete the Go source file it creates on the fly based on your input.
grumprunis in Python, which makes this reasonably easy.
grumpruna Python program that was '
import mymodule; mymodule.main()' and grab the Go source code it generated (now that it wasn't deleting said source code afterward). This gave me a Go program that I could build into a binary that I could keep and then run with command line arguments.
Unfortunately it turns out that this didn't do me any good. First, the
compiled binary of my Grumpy-transpiled Python code also took about
the same 0.05 of a second to start and run as my real Python code.
Second, my code immediately failed because Grumpy has not fully
set()s; in particular, it doesn't have the
.difference() method. This is not listed in their Missing
wiki page, but Grumpy is underdocumented in general.
(As a general note, Grumpy appears to be in a state of significant churn in how it operates and how you use it, which I suppose is not particularly surprising. You can find older articles on how to use Grumpy that clearly worked at the time but don't work any more.)
This whole experience has unfortunately left me much less interested in Grumpy. As it is today, Grumpy's clearly not ready for outside people to do anything with it, and even in the future it may well never be good at the kind of things I want it for. Building fast-starting and fast-running programs may not ever be a Grumpy priority. Grumpy is an interesting experiment and I wish Google the best of luck with it, but it clearly can't be my great hope for faster, lighter-weight Python programs.
My meta-view of Grumpy is that right now it feels like an internal Google (or Youtube) tool that Google just happens to be developing in a public repository for us to watch.
(In this particular case my fix was to hand-write a second version of the program in Go, which has been part irritating and part interesting. The Go version runs in essentially no time, as I wanted and hoped, so the slow startup of the Grumpy version is not intrinsic to either Go or the problem. My Go version will not be the canonical version of this program for local reasons, so I'll have to maintain it myself in sync with the official Python version for as long as I care enough to.)
Sidebar: Part of why Grumpy is probably slow (and awkward)
It's an interesting exercise to look at the Go code that
generates. It's not anything like Go code as you'd conventionally
write it; instead, it's much closer to CPython bytecode that has been turned into Go code. This
faithfully implements the semantics of (C)Python, which is explicitly
one of Grumpy's goals, but it means that Grumpy has a significant
amount of overhead over a true Go solution in many situations.
(The transpiler may lower some Python types and expressions to more pure Go code under some circumstances, but scanning the generated output for my Python program suggests that this is uncommon to rare in the kind of code I write.)
Grumpy codes various Python types in pure Go
code, but as I found with
set, some of their implementations are
incomplete. In fact, now that I look I can see that the only Go
code in the entire project appears to be in those types, which
generally correspond to things that are implemented in C in CPython.
Everything else is generated by the transpiling process.
I've been hit by the startup overhead of small programs in Python
I've written before about how I care about the resource usage and speed of short running programs. However, that care has basically been theoretical. I knew this was an issue in general and it worried me because we have short running Python programs, but it didn't impact me directly and our systems didn't seem to be suffering as a result of it. Even DWiki running as a CGI was merely kind of embarrassing.
Today, I turned a hacky personal shell script into a better done
production ready version that I rewrote in Python. This worked fine
and everything was great right up to the point where I discovered
that I had made this script a critical path in invoking
dmenu on my office workstation, which is something
that I do a lot (partly because I have a very convenient key
binding for it). The new Python version
is not slow as such, but it is slower, and it turns out that I am
very sensitive to even moderate startup delays with
because I type ahead, expecting
dmenu to appear essentially
instantly). With the old shell script version, this part of
startup took around one to two hundredths of a second; with the new
Python version, things now takes around a quarter of a second, which
is enough lag to be perceptible and for my type-ahead to go awry.
(This assumes that my machine is unloaded, which is not always the
case. Active CPU load, such as installing Ubuntu in a test VM, can
make this worse. My
dmenu setup actually runs this program five
times to extract various information, so each individual run is
taking about five hundredths of a second.)
Profiling and measuring short running Python programs is a bit
challenging and I've wound up resorting to fairly crude tricks (such
as just exiting from the program at strategic points). These tricks
strongly suggest that almost all of the extra time is going simply
to starting Python, with a significant amount of it spent importing
the standard library modules I use (and all of the things that they
import in turn). Simply getting to the quite early point where I
parse_args ArgumentParser method consumes almost all of the
time on my desktop. My own code contributes relatively little to
the slower execution (although not nothing), which unfortunately
means that there's basically no point in trying to optimize it.
(On the one hand, this saves me time. On the other hand, optimizing Python code can be interesting.)
My inelegant workaround for now is to cache the information my program is producing, so I only have to run the program (and take the quarter second delay) when its configuration file changes; this seems to work okay and it's as least as fast as the old shell script version. I'm hopeful that I won't run into any other places where I'm using this program in a latency sensitive situation (and anyway, such situations are likely to have less latency since I'm probably only running it once).
In the longer run it would be nice to have some relatively general solution to pre-translate Python programs into some faster to start form. For my purposes with short running programs it's okay if the translated result has somewhat less efficient code, as long as it starts very fast and thus finishes fast for programs that only run briefly. The sort of obvious candidate is Google's grumpy project; unfortunately, I can't figure out how to make it convert and build programs instead of Python modules, although it's clearly possible somehow.
PS: The new version of the program is written in Python instead of shell because a non-hacky version of the job it's doing is more complicated than is sensible to implement in a shell script (it involves reading a configuration file, among other issues). It's written in Python instead of Go for multiple reasons, including that we've decided to standardize on only using a few languages for our tools and Go currently isn't one of them (I've mentioned this in a comment on this entry).
Why I care about Apache's mod_wsgi so much
I made a strong claim yesterday in an aside: I said that Apache with mod_wsgi is the easiest and most seamless way of running a Python WSGI app, and thus it was a pity that it doesn't support using PyPy for this. As I have restarted it here this claim is a bit too strong, so I have to start by watering it down. Apache with mod_wsgi is definitely the easiest and most seamless way to run a Python WSGI app in a shared (web) environment, where you have out a general purpose web server that handles a variety of URLs and services. It may also be your best option if the only thing the web server is doing is running your WSGI application, but I don't have any experience with such environments.
(I focus on shared web environments because none of my WSGI apps are likely to ever be so big and so heavily used that I need to devote an entire web server to them.)
Apache is a good choice as a general purpose web server in the first place, and once you have Apache, mod_wsgi makes deploying a WSGI application pretty straightforward. Generally all you need is a couple of lines of Apache configuration, and you can even arrange to have your WSGI application run under another Unix UID if you want (speaking as a sysadmin, that's a great thing; I would like as few things as possible to run as the web server UID). There's no need to run, configure, and manage another daemon, or to coordinate configuration changes between your WSGI daemon and your web server. Do you want to reload your app's code? Touch a file and it happens, you're done. And all of this lives seamlessly alongside everything else in the web server's configuration, including other WSGI apps also being handled through mod_wsgi.
As far as I know, every other option for getting a WSGI app up and running is more complicated, sometimes fearsomely so. I would like an even simpler option, but until such a thing arrives, mod_wsgi is as close as I can get (and it works well even in unusual situations).
I care about WSGI in general because it's the broadly right way to deploy a Python web app. The easier and simpler it is to deploy a WSGI app, the less likely I am to just write my initial simple version of something as a CGI and then get sucked into very peculiar lashups.
If you're going to use PyPy, I think you need servers
I have a long-standing interest in PyPy for the straightforward reason that it certainly would be nice to get a nice performance increase for my Python code basically for free, and I do have some code that is at least somewhat CPU-intensive. Also, to be honest, the idea and technology of PyPy is really neat and so I would like it to work out.
Back a few years ago when I did some experiments, one of the drawbacks of PyPy for my sort of interests was that it took a substantial amount of execution time to warm up and start performing better than CPython. I just gave the latest PyPy release a quick spin (using this standalone package for Linux (via)), and while it's faster than previous versions it still has that warm-up requirement, which is neither unexpected nor surprising (and in fact the PyPy FAQ explicitly talks about this). But this raises a question; if I want to use PyPy to speed up my Python code, what would it take?
If PyPy only helps on long running code, then that means I need to run things as servers instead of one-shot programs. This is doable; almost anything can be recast as a server if you try hard enough (and perhaps write the client in another, lighter weight language). However it's not enough to just have, say, a preforking server where the actual worker processes only do a bit of work and then die off, because that doesn't get you the long running code that PyPy needs. Instead you need either long running worker processes or threads within a single server process, and given Python's GIL you probably want the former.
(And yes, PyPy still has a GIL.)
A straightforward preforking server is going to duplicate a lot of warm-up work in each worker process, because the main server process doesn't do very much work on its own before it starts worker processes. I can imagine hacks to deal with this, such as having the server go through a bunch of synthetic requests before it starts forking off workers to handle real ones. This might have the useful side effect of reducing the overall memory overhead of PyPy by sharing more JIT data between worker processes. It does require you to generate synthetic requests, which is easy for me in one environment but not so much so for another.
There is one obvious server environment that's an entirely natural fit for running Python code easily, and would in fact easily handle DWiki (the code behind this blog). That is Apache with mod_wsgi, which transparently runs your Python WSGI app in some server processes. Unfortunately, as far as I know mod_wsgi doesn't support PyPy and I don't think there are any plans to change that.
(There are other ways to run WSGI apps using PyPy, but none of them are as easy and seamless as Apache with mod_wsgi and thus all of them are less interesting to me.)