Minimizing object churn to optimize Python code
One of the interesting things about writing code in high level garbage collected languages is how your optimization concerns can change. One optimization in all GC'd languages, Python included, is minimizing the creation of new objects and maximizing the reuse of old ones; the fewer new objects you create, the fewer cycles garbage collection eats, the less memory gets fragmented, and you often even use less memory overall.
Minimizing object churn is behind several well-known Python (anti-)patterns, such as 'avoid repeated string concatenation to build a string'.
(The simple approach to accumulating one big string from a bunch of
function calls is to concatenate each value to the building result
string as you get it. But this churns objects (and copies data) at
every concatenation; the right pattern is to put all the function
results in a list and then use
"".join(reslst) at the end.)
A lot of object churn avoidance can be summed up as leave objects alone if at all possible. This is especially important for large immutable objects like strings; any time you fiddle with a string you wind up with a new object. This means that you want to push fiddling with objects as close to their creation as possible, because this maximizes the chance for sharing the same object.
When you write code, look at it with an eye to how many objects it churns through, and how necessary those are. Some techniques include:
- if you always modify constants in some way, pre-compute the modified version and store it too.
- batch up things and do them all at once (
"".join()is an example of this).
- use things like list comprehensions and string
.join(); being built in to the interpreter, they can optimize in ways not possible for your Python code.
- it's better to return an existing constant string than the 'same' string (same value) that you parsed from another string. (This generalizes to all objects.)
- if you have to lowercase strings that are usually lower-cased (or
upper-cased, or trimmed of surrounding whitespace, and so on) to
start with, it make be worthwhile to do something like '
nstr = ostr.lower(); if nstr == ostr: nstr = ostr'. This doesn't reduce object churn much, but it will save memory and it minimizes fragmentation by destroying any unnecessary
nstr's almost immediately. (This too can be generalized beyond strings.)
- consider caches; one of their side effects is reducing object churn by returning references to existing objects. Of course, the devil is in the details, including cache aging.
Sidebar: saving string memory with
Python lets you 'intern' strings via the built-in
which maps duplicate strings into the same object. This makes it the
easy way to eliminate memory wasted due to duplicate strings in your
program, at the cost of a hash lookup in
intern() for every string
that you want to de-duplicate. (Really, the
explains this better than I do.)
You really want to be using Python 2.3 or later before you use this much, because before Python 2.3, any interned strings were immortal. (The interning code held a reference to them so they never got garbage collected.)