Minimizing object churn to optimize Python code

November 5, 2005

One of the interesting things about writing code in high level garbage collected languages is how your optimization concerns can change. One optimization in all GC'd languages, Python included, is minimizing the creation of new objects and maximizing the reuse of old ones; the fewer new objects you create, the fewer cycles garbage collection eats, the less memory gets fragmented, and you often even use less memory overall.

Minimizing object churn is behind several well-known Python (anti-)patterns, such as 'avoid repeated string concatenation to build a string'.

(The simple approach to accumulating one big string from a bunch of function calls is to concatenate each value to the building result string as you get it. But this churns objects (and copies data) at every concatenation; the right pattern is to put all the function results in a list and then use "".join(reslst) at the end.)

A lot of object churn avoidance can be summed up as leave objects alone if at all possible. This is especially important for large immutable objects like strings; any time you fiddle with a string you wind up with a new object. This means that you want to push fiddling with objects as close to their creation as possible, because this maximizes the chance for sharing the same object.

When you write code, look at it with an eye to how many objects it churns through, and how necessary those are. Some techniques include:

  • if you always modify constants in some way, pre-compute the modified version and store it too.
  • batch up things and do them all at once ("".join() is an example of this).
  • use things like list comprehensions and string .join(); being built in to the interpreter, they can optimize in ways not possible for your Python code.
  • it's better to return an existing constant string than the 'same' string (same value) that you parsed from another string. (This generalizes to all objects.)
  • if you have to lowercase strings that are usually lower-cased (or upper-cased, or trimmed of surrounding whitespace, and so on) to start with, it make be worthwhile to do something like 'nstr = ostr.lower(); if nstr == ostr: nstr = ostr'. This doesn't reduce object churn much, but it will save memory and it minimizes fragmentation by destroying any unnecessary nstr's almost immediately. (This too can be generalized beyond strings.)
  • consider caches; one of their side effects is reducing object churn by returning references to existing objects. Of course, the devil is in the details, including cache aging.

Sidebar: saving string memory with intern

Python lets you 'intern' strings via the built-in intern() function, which maps duplicate strings into the same object. This makes it the easy way to eliminate memory wasted due to duplicate strings in your program, at the cost of a hash lookup in intern() for every string that you want to de-duplicate. (Really, the documentation explains this better than I do.)

You really want to be using Python 2.3 or later before you use this much, because before Python 2.3, any interned strings were immortal. (The interning code held a reference to them so they never got garbage collected.)

Written on 05 November 2005.
« Modifying Linux kernel configuration options the easy way
Weekly spam summary on November 5th, 2005 »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Nov 5 03:17:32 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.