Wandering Thoughts

2017-11-27

Code stability in my one Django web application

We have one Django web application, a system for automating the handling of much of our new Unix account requests. It was started in early 2011 (using Django 1.2) and I did a retrospective at the end of 2014 where I called it a faithful web app, one that had just kept on quietly working without problems. That's continued through to today; the app needs no routine attention, although every so often I tweak it to better handle an obscure situation.

One of the interesting aspects of that quiet stability is the relative stability of the application's Python code over those nearly six years so far. There are web frameworks where in six years you'd need to significantly rework and restructure your code to deal with changing APIs and approaches. For us, Django hasn't been one of them. Although we're not quite current on Django versions, we're not that far back, yet much of the code is basically the same (or literally the same) as it started out all those years ago. I'm pretty sure that almost all of our model and view code is untouched over that time, and I think a lot of our templates are untouched or only minorly changed.

However, this is not a complete picture of code churn in our app, because there have been Django changes over that time in areas such as routing, command argument processing, template processing, and project structure. These changes have forced code changes in the areas of our app that deal with such things (and the change in project structure eventually forced a massive renaming of files when we went to Django 1.9). While this sounds kind of bad, I've wound up considering all of them to be relatively peripheral. In a way, all of the code involved is plumbing and glue. None of it really touches the heart of our web application, which (for us) lives mostly in the models and views and somewhat in the core logic of the templates. Django has been very good about keeping that core code from needing any substantive changes. We still validate form submissions and generate views and process model data in basically the same way we did in 2011, and all of that is what I think of as the hard stuff.

(Although I haven't measured, I think also it's most of the app's code by line count.)

This code stability is one reason why Django upgrades have been somewhat painful but not deeply painful. If we'd needed major code restructuring, well, I'd probably have done it eventually because we might have had no choice, but we'd have likely updated Django versions more sporadically than we have so far.

PS: Although Django is going from version 1.11 to version 2.0 in the next release, the Django people say that this shouldn't be any more of an upgrade than usual. And speaking of that. I should get working on updating us to 1.11, since security updates for 1.10 will end soon (if they haven't already).

DjangoAppCodeStability written at 23:13:02; Add Comment

2017-11-05

How collections.defaultdict is good for your memory usage

There is a classical pattern in code that uses entries in dictionaries to accumulate data. In the simplest form, it looks like this:

 e = dct.get(ky, None)
 if e is None:
    e = []
    dct[ky] = e

 # now we work on e without
 # caring if it's new or old

There is an obvious variation of this that gets rid of the whole bureaucracy involving the if:

e = dct.setdefault(ky, [])
# work on e

On the surface, this looks very much like what you get with collections.defaultdict. At this level you might reasonably think that defaultdict is just a convenience, giving you a slightly shorter and nicer way to write this code so you don't have to do either the if or use .setdefault() instead of just doing a simple dct[ky]. However, there's an important way that both defaultdict and the if-based version are better than the .setdefault() version.

To see it, let's change what the individual elements are:

e = dct.setdefault(ky, ExpensiveItem())
....

When I write things this way, the problem may jump out right away. The issue with this version is that we always create a new ExpensiveItem object regardless of whether ky is already in dct. If ky is not in dct, we use the new object and all is good, but if there already is one, we throw away the new object we created. If we're dealing with a lot of keys that already exist, this is a lot of objects being created and then immediately thrown away. Both the if-based version and defaultdict avoid this problem because they only create a new object if and when they actually need it, and a defaultdict version is just as short as the .setdefault() version.

(The other subtle advantage of defaultdict is that you specify the default item only once, when you create the dictionary, instead of having to duplicate it in every section of code where you need to do this update-or-add pattern.)

On the one hand, this advantage of defaultdict feels obvious once I write it out like this. On the other hand, Python doesn't really encourage people to think about how often objects are created and other aspects of memory churn. Also, even if you know about the issue (as I generally do), it's tempting to go with the setdefault() version instead of the if version just because it's shorter and you probably aren't dealing with enough objects for this to matter. Using collections.defaultdict lets you have your cake and eat it too; you get short code and memory efficiency.

DefaultdictAndMemoryChurn written at 23:56:04; Add Comment

2017-10-18

I still like Python and often reach for it by default

Various local events recently made me think a bit about the future of Python at work. We're in a situation where a number of our existing tools will likely get drastically revised or entirely thrown away and replaced, and that raises local issues with Python 3 as well as questions of whether I should argue for changing our list of standard languages. I have some technical views on the answer, but thinking through this has made me realize something on a more personal level. Namely, I still like Python and it's my go-to default language for a number of things.

I'm probably always going to be a little bit grumpy about the whole transition toward Python 3, but that in no way erases the good parts of Python. Despite the baggage around it, Python 3 has its own good side and I remain reasonably enthused about it. Writing modest little programs in Python has never been a burden; the hard parts are never from Python, they're from figuring out things like data representation and that's the same challenge in any language. In the mean time, Python's various good attributes make it pretty plastic and easily molded as I'm shaping and re-shaping my code as I figure out more of how I want to do things.

(In other words, experimenting with my code is generally reasonably easy. When I may completely change how I approach a problem between my first draft and my second attempt, this is quite handy.)

Also, Python makes it very easy to do string-bashing and to combine it with basic Unix things. This describes a lot of what I do, which means that Python is a low-overhead way of writing something that is much like a shell script but that's more structured, better organized, and expresses its logic more clearly and directly (because it's not caught up in the Turing tarpit of Bourne shell).

(This sort of 'better shell script' need comes up surprisingly often.)

My tentative conclusion about what this means for me is that I should embrace Python 3, specifically I should embrace it for new work. Despite potential qualms for some things, new program that I write should be in Python 3 unless there's a strong reason they can't be (such as having to run on a platform with an inadequate or missing Python 3). The nominal end of life for Python 2 is not all that far off, and if I'm continuing with Python in general (and I am), then I should be carrying around as little Python 2 code as possible.

IStillLikePython written at 02:58:38; Add Comment

2017-10-03

Some thoughts on having both Python 2 and 3 programs

Earlier, I wrote about my qualms about using Python 3 in (work) projects in light of the extra burden it might put on my co-workers if they had to work on the code. One possible answer here is that it's possible both to use Python 3 features in Python 2 and to write code that naturally runs unmodified under both versions (as I did without explicitly trying to). This is true, but there's a catch and that catch matters in this situation.

The compatibility between Python 2 and Python 3 is not symmetric. If you write natural Python 3 code, it can often run under Python 2, sometimes with __future__ imports. However, if you write natural Python 2 code it will not run under Python 3, unless your code completely avoids at least print as a statement and mixing tabs and spaces. A Python 3 programmer who knows very little about Python 2 and who simply writes natural code can produce a program that runs unaltered under Python 2 and can probably modify a Python 2 program without having it blow up in their face. But a Python 2 programmer who tries to work on a Python 3 program is quite possibly going to have things explode. They could get lucky, but all it takes is one print statement and Python 3 is complaining. This is true even if the original Python 3 code is careful to be Python 2 compatible (it uses appropriate __future__ imports and so on).

Since there are Python 3 features that are simply not available in Python 2 even with __future__ imports, a Python 3 programmer can still wind up blowing up a Python 2 program. But as someone who's now written both Python 2 and Python 3 code (including some that wound up being valid Python 2 code too), my feeling is that you have to go at least a bit out of your way in straightforward code to wind up doing this. By contrast, it's very easy for a Python 2 programmer to use Python 2 only things in code, partly because one of them (print statements) is a long standing standard Python 2 idiom. A Python 2 programmer is relatively unlikely to produce code that also runs on Python 3 unless they explicitly try to (which requires a number of things, including awareness that there is even a Python 3).

So if you have part-time Python 3 programmers and some Python 2 programs, you'll probably be fine (and you can increase the odds by putting __future__ imports into the Python 2 programs in advance, so they're fully ready for Python 3 idioms like print() as a function). If you have part-time Python 2 programmers and some Python 3 programs, you're probably going to have to keep an eye on things; people may get surprises every so often. Unfortunately there's nothing you can really do to make the Python 3 code able to deal with Python 2 idioms like print statements.

(In the long run it seems clear that everyone is going to have to learn about Python 3, but that's another issue and problem. I suspect that many places are implicitly deferring it until they have no choice. I look forward to an increasing number of 'what to know about Python 3 for Python 2 programmers' articles as we approach 2020 and the theoretical end of Python 2 support.)

MixingPython2And3Programs written at 00:19:38; Add Comment

2017-09-21

My potential qualms about using Python 3 in projects

I wrote recently about why I didn't use the attrs module recently; the short version is that it would have forced my co-workers to learn about it in order to work on my code. Talking about this brings up a potentially awkward issue, namely Python 3. Just like the attrs module, working with Python 3 code involves learning some new things and dealing with some additional concerns. In light of this, is using Python 3 in code for work something that's justified?

This issue is relevant to me because I actually have Python 3 code these days. For one program, I had a concrete and useful reason to use Python 3 and doing so has probably had real benefits for our handling of incoming email. But for other code I've simply written it in Python 3 because I'm still kind of enthused about it and everyone (still) does say it's the right thing to do. And there's no chance that we'll be able to forget about Python 2, since almost all of our existing Python code uses Python 2 and isn't going to change.

However, my tentative view is that using Python 3 is a very different situation than the attrs module. To put it one way, it's quite possible to work with Python 3 without noticing. At a superficial level and for straightforward code, about the only difference between Python 3 and Python 2 is print("foo") versus 'print "foo". Although I've said nasty things about Python 3's automatic string conversions in the past, they do have the useful property that things basically just work in a properly formed UTF-8 environment, and most of the time that's what we have for sysadmin tools.

(Yes, this isn't robust against nasty input, and some tools are exposed to that. But many of our tools only process configuration files that we've created ourselves, which means that any problems are our own fault.)

Given that you can do a great deal of work on an existing piece of Python code without caring whether it's Python 2 or Python 3, the cost of using Python 3 instead of Python 2 is much lower than, for example, the cost of using the attrs module. Code that uses attrs is basically magic if you don't know attrs; code in Python 3 is just a tiny bit odd looking and it may blow up somewhat mysteriously if you do one of two innocent-seeming things.

(The two things are adding a print statement and using tabs in the indentation of a new or changed line. In theory the latter might not happen; in practice, most Python 3 code will be indented with spaces.)

In situations where using Python 3 allows some clear benefit, such as using a better version of an existing module, I think using Python 3 is pretty easily defensible; the cost is very likely to be low and there is a real gain. In situations where I've just used Python 3 because I thought it was neat and it's the future, well, at least the costs are very low (and I can argue that this code is ready for a hypothetical future where Python 2 isn't supported any more and we want to migrate away from it).

Sidebar: Sometimes the same code works in both Pythons

I wrote my latest Python code as a Python 3 program from the start. Somewhat to my surprise, it runs unmodified under Python 2.7.12 even though I made no attempt to make it do so. Some of this is simply luck, because it turns out that I was only ever invoking print() with a single argument. In Python 2, print("fred") is seen as 'print ("fred")', which is just 'print "fred"', which works fine. Had I tried to print() multiple arguments, things would have exploded.

(I have only single-argument print()s because I habitually format my output with % if I'm printing out multiple things. There are times when I'll deviate from this, but it's not common.)

Python3LearningQualms written at 01:35:57; Add Comment

2017-09-17

Why I didn't use the attrs module in a recent Python project

I've been hearing buzz about the attrs Python module for a while (for example). I was recently writing a Python program where I had some structures and using attrs to define the classes involved would have made the code shorter and more obvious. At first I was all fired up to finally use attrs, but then I took a step back and reluctantly decided that doing so would be the wrong choice.

You see, this was code for work, and while my co-workers can work in Python, they're not Python people in the way that I am. They're certainly not up on the latest Python things and developments; to them, Python is a tool and they're happy to let it be if they don't need to immerse themselves in it. Naturally, they don't know anything about the attrs module.

If I used attrs, the code would be a bit shorter (and it'd be neat to actually use it), but my co-workers would have to learn at least something about attrs before they could understand my code to diagnose problems, make changes, or otherwise work on it. Using straightforward structure-style classes is boring, but it's not that much more code and it's code that's using a familiar, well established idiom that pretty much everyone is already familiar with.

Given this situation, I did the responsible thing and decided that my desire to play around with attrs was in no way a sufficient justification for inflicting another Python module to learn on my co-workers. Boring straightforward code has its advantages.

I can think of two things that would change this calculation. The first is if I needed more than just simple structure-style classes, so that attrs was saving me a significant chunk of code and making the code that remained much clearer. If I come out clearly ahead with attrs even after adding explanatory comments for my co-workers (or future me), then attrs is much more likely to be a win overall instead of just an indulgence.

(I think that the amount of usage and the size of the codebase matters too, but for us our codebases are small since we're just writing system utility programs and so on in Python.)

The second is if attrs usage becomes relatively widespread, so that my co-workers may well be encountering it in other people's Python code that we have to deal with, in online documentation, and so on. Then using attrs would add relatively little learning overhead and might even have become the normal idiom. This is part of why I feel much more free to use modules in the standard library than third-party modules; the former are, well, 'standard' in at least some sense.

(Mind you, these days I'm sufficiently out of touch with the Python world that I'm not sure how I'd find out if attrs was a big, common thing. Perhaps if Django started using and recommending it.)

AttrsLearningProblem written at 01:45:54; Add Comment

2017-08-11

Some notes from my brief experience with the Grumpy transpiler for Python

I've been keeping an eye on Google's Grumpy Python to Go transpiler more or less since it was introduced because it's always been my great white hope for speeding up my Python code more or less effortlessly (and I like Go). However, until recently I had never actually tried to do anything much with it because I didn't really have a problem that it looked like a good fit for. What changed is that I finally got hit by the startup overhead of small programs.

As mentioned in that entry, my initial attempts to use Grumpy weren't successful, because how to actually use Grumpy for anything beyond toys is basically not documented today. Because sometimes I'm stubborn, I kept banging my head against the wall for long enough until I hacked together how to bring up my program, which gave me the chance to get some real world results. Basically the process went like this:

  • build Grumpy from source following their 'method 2' process (using the Fedora 25 system version of Go, not my own build, because Grumpy very much didn't work with the latter).
  • have Grumpy translate my Python program into a module, which was possible because I'd kept it importable.
  • hack grumprun to not delete the Go source file it creates on the fly based on your input. grumprun is in Python, which makes this reasonably easy.
  • feed grumprun a Python program that was 'import mymodule; mymodule.main()' and grab the Go source code it generated (now that it wasn't deleting said source code afterward). This gave me a Go program that I could build into a binary that I could keep and then run with command line arguments.

Unfortunately it turns out that this didn't do me any good. First, the compiled binary of my Grumpy-transpiled Python code also took about the same 0.05 of a second to start and run as my real Python code. Second, my code immediately failed because Grumpy has not fully implemented Python set()s; in particular, it doesn't have the .difference() method. This is not listed in their Missing features wiki page, but Grumpy is underdocumented in general.

(As a general note, Grumpy appears to be in a state of significant churn in how it operates and how you use it, which I suppose is not particularly surprising. You can find older articles on how to use Grumpy that clearly worked at the time but don't work any more.)

This whole experience has unfortunately left me much less interested in Grumpy. As it is today, Grumpy's clearly not ready for outside people to do anything with it, and even in the future it may well never be good at the kind of things I want it for. Building fast-starting and fast-running programs may not ever be a Grumpy priority. Grumpy is an interesting experiment and I wish Google the best of luck with it, but it clearly can't be my great hope for faster, lighter-weight Python programs.

My meta-view of Grumpy is that right now it feels like an internal Google (or Youtube) tool that Google just happens to be developing in a public repository for us to watch.

(In this particular case my fix was to hand-write a second version of the program in Go, which has been part irritating and part interesting. The Go version runs in essentially no time, as I wanted and hoped, so the slow startup of the Grumpy version is not intrinsic to either Go or the problem. My Go version will not be the canonical version of this program for local reasons, so I'll have to maintain it myself in sync with the official Python version for as long as I care enough to.)

Sidebar: Part of why Grumpy is probably slow (and awkward)

It's an interesting exercise to look at the Go code that grumpc generates. It's not anything like Go code as you'd conventionally write it; instead, it's much closer to CPython bytecode that has been turned into Go code. This faithfully implements the semantics of (C)Python, which is explicitly one of Grumpy's goals, but it means that Grumpy has a significant amount of overhead over a true Go solution in many situations.

(The transpiler may lower some Python types and expressions to more pure Go code under some circumstances, but scanning the generated output for my Python program suggests that this is uncommon to rare in the kind of code I write.)

Grumpy codes various Python types in pure Go code, but as I found with set, some of their implementations are incomplete. In fact, now that I look I can see that the only Go code in the entire project appears to be in those types, which generally correspond to things that are implemented in C in CPython. Everything else is generated by the transpiling process.

GrumpyBriefExperience written at 02:36:07; Add Comment

2017-08-05

I've been hit by the startup overhead of small programs in Python

I've written before about how I care about the resource usage and speed of short running programs. However, that care has basically been theoretical. I knew this was an issue in general and it worried me because we have short running Python programs, but it didn't impact me directly and our systems didn't seem to be suffering as a result of it. Even DWiki running as a CGI was merely kind of embarrassing.

Today, I turned a hacky personal shell script into a better done production ready version that I rewrote in Python. This worked fine and everything was great right up to the point where I discovered that I had made this script a critical path in invoking dmenu on my office workstation, which is something that I do a lot (partly because I have a very convenient key binding for it). The new Python version is not slow as such, but it is slower, and it turns out that I am very sensitive to even moderate startup delays with dmenu (partly because I type ahead, expecting dmenu to appear essentially instantly). With the old shell script version, this part of dmenu startup took around one to two hundredths of a second; with the new Python version, things now takes around a quarter of a second, which is enough lag to be perceptible and for my type-ahead to go awry.

(This assumes that my machine is unloaded, which is not always the case. Active CPU load, such as installing Ubuntu in a test VM, can make this worse. My dmenu setup actually runs this program five times to extract various information, so each individual run is taking about five hundredths of a second.)

Profiling and measuring short running Python programs is a bit challenging and I've wound up resorting to fairly crude tricks (such as just exiting from the program at strategic points). These tricks strongly suggest that almost all of the extra time is going simply to starting Python, with a significant amount of it spent importing the standard library modules I use (and all of the things that they import in turn). Simply getting to the quite early point where I call argparse's parse_args ArgumentParser method consumes almost all of the time on my desktop. My own code contributes relatively little to the slower execution (although not nothing), which unfortunately means that there's basically no point in trying to optimize it.

(On the one hand, this saves me time. On the other hand, optimizing Python code can be interesting.)

My inelegant workaround for now is to cache the information my program is producing, so I only have to run the program (and take the quarter second delay) when its configuration file changes; this seems to work okay and it's as least as fast as the old shell script version. I'm hopeful that I won't run into any other places where I'm using this program in a latency sensitive situation (and anyway, such situations are likely to have less latency since I'm probably only running it once).

In the longer run it would be nice to have some relatively general solution to pre-translate Python programs into some faster to start form. For my purposes with short running programs it's okay if the translated result has somewhat less efficient code, as long as it starts very fast and thus finishes fast for programs that only run briefly. The sort of obvious candidate is Google's grumpy project; unfortunately, I can't figure out how to make it convert and build programs instead of Python modules, although it's clearly possible somehow.

(My impression is that both grumpy and Cython have wound up focused on converting modules, not programs. Like PyPy, they may also be focusing on longer running CPU-intensive code.)

PS: The new version of the program is written in Python instead of shell because a non-hacky version of the job it's doing is more complicated than is sensible to implement in a shell script (it involves reading a configuration file, among other issues). It's written in Python instead of Go for multiple reasons, including that we've decided to standardize on only using a few languages for our tools and Go currently isn't one of them (I've mentioned this in a comment on this entry).

StartupOverheadProblem written at 00:59:49; Add Comment

2017-07-26

Why I care about Apache's mod_wsgi so much

I made a strong claim yesterday in an aside: I said that Apache with mod_wsgi is the easiest and most seamless way of running a Python WSGI app, and thus it was a pity that it doesn't support using PyPy for this. As I have restarted it here this claim is a bit too strong, so I have to start by watering it down. Apache with mod_wsgi is definitely the easiest and most seamless way to run a Python WSGI app in a shared (web) environment, where you have out a general purpose web server that handles a variety of URLs and services. It may also be your best option if the only thing the web server is doing is running your WSGI application, but I don't have any experience with such environments.

(I focus on shared web environments because none of my WSGI apps are likely to ever be so big and so heavily used that I need to devote an entire web server to them.)

Apache is a good choice as a general purpose web server in the first place, and once you have Apache, mod_wsgi makes deploying a WSGI application pretty straightforward. Generally all you need is a couple of lines of Apache configuration, and you can even arrange to have your WSGI application run under another Unix UID if you want (speaking as a sysadmin, that's a great thing; I would like as few things as possible to run as the web server UID). There's no need to run, configure, and manage another daemon, or to coordinate configuration changes between your WSGI daemon and your web server. Do you want to reload your app's code? Touch a file and it happens, you're done. And all of this lives seamlessly alongside everything else in the web server's configuration, including other WSGI apps also being handled through mod_wsgi.

As far as I know, every other option for getting a WSGI app up and running is more complicated, sometimes fearsomely so. I would like an even simpler option, but until such a thing arrives, mod_wsgi is as close as I can get (and it works well even in unusual situations).

I care about WSGI in general because it's the broadly right way to deploy a Python web app. The easier and simpler it is to deploy a WSGI app, the less likely I am to just write my initial simple version of something as a CGI and then get sucked into very peculiar lashups.

WhyApacheModWsgiMatters written at 01:14:20; Add Comment

2017-07-25

If you're going to use PyPy, I think you need servers

I have a long-standing interest in PyPy for the straightforward reason that it certainly would be nice to get a nice performance increase for my Python code basically for free, and I do have some code that is at least somewhat CPU-intensive. Also, to be honest, the idea and technology of PyPy is really neat and so I would like it to work out.

Back a few years ago when I did some experiments, one of the drawbacks of PyPy for my sort of interests was that it took a substantial amount of execution time to warm up and start performing better than CPython. I just gave the latest PyPy release a quick spin (using this standalone package for Linux (via)), and while it's faster than previous versions it still has that warm-up requirement, which is neither unexpected nor surprising (and in fact the PyPy FAQ explicitly talks about this). But this raises a question; if I want to use PyPy to speed up my Python code, what would it take?

If PyPy only helps on long running code, then that means I need to run things as servers instead of one-shot programs. This is doable; almost anything can be recast as a server if you try hard enough (and perhaps write the client in another, lighter weight language). However it's not enough to just have, say, a preforking server where the actual worker processes only do a bit of work and then die off, because that doesn't get you the long running code that PyPy needs. Instead you need either long running worker processes or threads within a single server process, and given Python's GIL you probably want the former.

(And yes, PyPy still has a GIL.)

A straightforward preforking server is going to duplicate a lot of warm-up work in each worker process, because the main server process doesn't do very much work on its own before it starts worker processes. I can imagine hacks to deal with this, such as having the server go through a bunch of synthetic requests before it starts forking off workers to handle real ones. This might have the useful side effect of reducing the overall memory overhead of PyPy by sharing more JIT data between worker processes. It does require you to generate synthetic requests, which is easy for me in one environment but not so much so for another.

There is one obvious server environment that's an entirely natural fit for running Python code easily, and would in fact easily handle DWiki (the code behind this blog). That is Apache with mod_wsgi, which transparently runs your Python WSGI app in some server processes. Unfortunately, as far as I know mod_wsgi doesn't support PyPy and I don't think there are any plans to change that.

(There are other ways to run WSGI apps using PyPy, but none of them are as easy and seamless as Apache with mod_wsgi and thus all of them are less interesting to me.)

PyPyWantsServers written at 00:06:05; Add Comment

(Previous 10 or go back to July 2017 at 2017/07/18)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.