2015-03-05
An interesting excursion with Python strings and is
Let's start with the following surprising interactive example from @xlerb's tweet:
>>> "foo" is ("fo" + "o") True >>> "foo" is ("fo".__add__("o")) False >>> "foo" == ("fo".__add__("o")) True
The last two case aren't surprising at all; they demonstrate that
equality is bigger than mere object identity, which is what is
tests (as I described in my entry on Python's two versions of
equality). The surprising case is the first
one; why do the two sides of that result in exactly the same object?
There turn out to be two things going on here, both of them quite
interesting.
The first thing going on is that CPython does constant folding on
string concatenation as part of creating bytecode. This means that
the '"fo" + "o"
' turns into a literal "foo"
in the actual
bytecodes that are executed. On the surface, this is enough to
explain the
check succeeding in some contexts. To make life simpler while
simultaneously going further down the rabbit hole, consider a
function like the following:
def f(): return "foo" is ("fo"+"o")
Compiled functions have (among other things) a table of strings and
other constants used in the function. Given constant folding and
an obvious optimization, you would expect "foo"
to appear in this
table exactly once. Well, actually, that's wrong; here's what
func_code.co_consts
is for this function in Python 2:
(None, 'foo', 'fo', 'o', 'foo')
(It's the same in Python 3, but now it's in __code__.co_consts
.)
Given this we can sort of see what happened. Probably the bytecode was
originally compiled without constant folding and then a later pass
optimized the string concatenation away and added the folded version to
co_consts
, operating on the entirely rational assumption that it
didn't duplicate anything already there. This would be a natural fit
for a simple peephole optimizer, which is in fact exactly what we find
in Python/peephole.c in the CPython 2 source code.
But how does this give us object identity? The answer has to be
that CPython interns
at least some of the literal strings used in CPython code. In fact,
if we check func_code.co_consts
for our function up above, we
can see that both "foo"
strings are in fact already the same
object even though there's two entries in co_consts
. The effect
is actually fairly strong; for example, the same literal string as
in two different modules can be interned to be the same object.
I haven't been able to find the CPython code that actually does this,
so I can't tell you what the exact conditions are.
(Whether or not a literal string is interned appears to depend partly on whether or not it has spaces in it. This rabbit hole goes a long way down.)
PS: I believe that this means I was wrong about some things I said in my entry on instance dictionaries and attribute names, in that more things get interned than I thought back then. Or maybe CPython grew more string interning optimizations since then.