Examining Python's string concatenation optimization

November 7, 2005

In the comments on the other day's 'minimizing object churn' entry, it was noted that Python string concatenation had been optimized some in the recent 2.4 release of the CPython interpreter. The 2.4 release notes say (from here):

  • String concatenations in statements of the form s = s + "abc" and s += "abc" are now performed more efficiently in certain circumstances.

But what is 'more efficiently' and what are the 'certain circumstances'? What does it take to get the optimized behavior?

Here's the summary: counting on this optimization is unwise, as it turns out to depend on low-level details of system memory allocation that can vary from system to system and workload to workload. Also, this is only for plain (byte) strings, not for Unicode strings; as of Python 2.4.2, Unicode string concatenation remains un-optimized.

The bits of string concatenation that theoretically can be optimized away are allocating a new string object, copying the one of the old string objects to the new one, and freeing the old object. (No matter what, CPython has to copy the data from one of the new string objects into the other.)

The new 2.4 optimization attempts to reuse the left side string object and enlarge it in place. For this to be possible, the object needs to have a reference count of one (and not be intern'd). To help create this situation, the CPython interpreter peeks ahead to see if the object has exactly two references and the next bytecode is an assignment to a local variable that currently points to the same object; if so, the variable gets zapped on the spot, dropping the reference count to one.

(Nit: this assignment reference dropping also happens for module-level code that refers to global variables.)

Even when the left side string object can be reused, we're not actually saving anything unless it can be enlarged in place. To be enlarged in place it needs to be at least about 235 bytes long (on 32-bit systems), and for the C library realloc() to have enough free memory after it that it can be enlarged into.

(CPython uses an internal allocator that always grows things by copying for all allocations of 256 bytes or less; string objects have about 21 bytes of overhead on a normal 32-bit platform. For more details, see the CPython source. If you're interested in the gory details, start with the string_concatenate function in Python/ceval.c and work outwards.)

I suspect that the optimization is most likely to trigger when you are repeatedly growing a fairly large string by a small bit, as C libraries often use allocation strategies that leave reasonable amounts of space free after large objects. If your program mostly deals with smaller strings, you may not benefit much (if at all) from this optimization.

One consequence of all this is that none of the following will be optimized, because the left side reference (implicit in the case of '+=') cannot be reduced to a reference count of one, because the assignment is of the wrong type:

def foo(self, alst, a):
  global bar; bar += a
  self.foo += a
  alst[0] += a

This isn't too surprising, since both self.foo and alst[0] aren't necessarily simple store operations once the dust settles. The global case could probably be optimized, but may be considered to be too uncommon to bother writing the code for.

However, 's = strfunc(...) + strfunc2(...)' and the like can get optimized, unless there's another reference to strfunc()'s return value hanging around someone. In fact a whole series of string concatenations of function return values can get optimized down. (Always assuming that the string sizes and the free memory and so on work out.)

Sidebar: what about optimizing string prepending?

By string prepending, I mean 's = "abc" + s' (versus 's = s + "abc"'). CPython doesn't optimize this, because it only looks at the left side string object.

You can't significantly optimize string prepending in CPython, because you always have to move s's current contents up to make room at the start for the string you're sticking on the front. This means that all you could possibly save is the malloc() and free() overhead and the overhead of setting up the new string's object header. This is probably small relative to copying the existing string data, so not a very compelling optimization.

(I suspect that string prepending is also uncommon in Python code.)

Comments on this page:

From at 2005-11-07 18:38:55:

Note that the aliasing issue is unlikely to be a significant barrier to optimizing repeated concatenation; any code along the lines of

  foo = first_item
  bar = foo

  for z in some_list
     foo = foo + z

  baz = foo

might not get an optimization on the first iteration, because foo is aliased; but after that foo becomes unaliased, and thus is optimizable. Thus if there would normally be N temps generated, the optimization may reduce it to only 1 temp generated.

For aliasing to prevent the optimization, you'd have to alias each temporary as its generated. Possibly this might happen if you used a recursive algorithm rather than an iterative one, but it seems unlikely in the iterative case.


Written on 07 November 2005.
« Weekly spam summary on November 5th, 2005
Fun with DNS: a semi-obscure problem »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Nov 7 01:31:48 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.