An interesting excursion with Python strings and is

March 5, 2015

Let's start with the following surprising interactive example from @xlerb's tweet:

>>> "foo" is ("fo" + "o")
True
>>> "foo" is ("fo".__add__("o"))
False
>>> "foo" == ("fo".__add__("o"))
True

The last two case aren't surprising at all; they demonstrate that equality is bigger than mere object identity, which is what is tests (as I described in my entry on Python's two versions of equality). The surprising case is the first one; why do the two sides of that result in exactly the same object? There turn out to be two things going on here, both of them quite interesting.

The first thing going on is that CPython does constant folding on string concatenation as part of creating bytecode. This means that the '"fo" + "o"' turns into a literal "foo" in the actual bytecodes that are executed. On the surface, this is enough to explain the check succeeding in some contexts. To make life simpler while simultaneously going further down the rabbit hole, consider a function like the following:

def f():
  return "foo" is ("fo"+"o")

Compiled functions have (among other things) a table of strings and other constants used in the function. Given constant folding and an obvious optimization, you would expect "foo" to appear in this table exactly once. Well, actually, that's wrong; here's what func_code.co_consts is for this function in Python 2:

(None, 'foo', 'fo', 'o', 'foo')

(It's the same in Python 3, but now it's in __code__.co_consts.)

Given this we can sort of see what happened. Probably the bytecode was originally compiled without constant folding and then a later pass optimized the string concatenation away and added the folded version to co_consts, operating on the entirely rational assumption that it didn't duplicate anything already there. This would be a natural fit for a simple peephole optimizer, which is in fact exactly what we find in Python/peephole.c in the CPython 2 source code.

But how does this give us object identity? The answer has to be that CPython interns at least some of the literal strings used in CPython code. In fact, if we check func_code.co_consts for our function up above, we can see that both "foo" strings are in fact already the same object even though there's two entries in co_consts. The effect is actually fairly strong; for example, the same literal string as in two different modules can be interned to be the same object. I haven't been able to find the CPython code that actually does this, so I can't tell you what the exact conditions are.

(Whether or not a literal string is interned appears to depend partly on whether or not it has spaces in it. This rabbit hole goes a long way down.)

PS: I believe that this means I was wrong about some things I said in my entry on instance dictionaries and attribute names, in that more things get interned than I thought back then. Or maybe CPython grew more string interning optimizations since then.

Written on 05 March 2015.
« What creates inheritance?
The simple way CPython does constant folding »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Mar 5 00:22:04 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.