2020-02-04
What 'is
' translates to in CPython bytecode
The main implementation of Python, usually called CPython , translates Python source code into bytecode
before interpreting it. How this translation happens can make
some things fast, such as how local variables are implemented. When I wrote in yesterday's entry that having 'is
' as a keyword can make it
faster than if it was a built-in function because as a keyword
it doesn't have to be looked up all the time just in case you
changed it, I wondered how CPython actually translated 'a is b
'
to bytecode. The answer turns out to be somewhat more interesting
than I expected.
(Bytecode can be most conveniently inspected with the dis
module, and the
module's documentation helpfully explains a fair bit about what
the disassembled representation means.)
Let's define a little function:
def f(a): return a is 10
Now we can disassemble this with 'dis.dis(f.__code__)
' and
get:
2 0 LOAD_FAST 0 (a) 2 LOAD_CONST 1 (10) 4 COMPARE_OP 8 (is) 6 RETURN_VALUE
CPython bytecodes can have an auxiliary value associated with them
(shown here as the rightmost column, along with their meaning for
the particular bytecode operation). Rather than have separate
bytecodes for different comparison operators, all comparisons are
implemented with a single bytecode, COMPARE_OP
,
that picks which comparison to do based on the auxiliary value.
The 'is
' comparison is just the same as any other; if we used
'return a > 10
' in our function, the only difference in the
bytecode would be the auxiliary value for COMPARE_OP
(it would
become 4 instead of 8).
The next obvious question to ask is how 'is not
' is implemented,
and the answer is that it's another comparison type. If we change
our function to use 'is not
', the only change is this:
4 COMPARE_OP 9 (is not)
CPython has one last trick up its sleeve. If we write 'not a is
10
', CPython specifically recognizes this and rather than translating
it as a COMPARE_OP
followed by a UNARY_NOT
,
translates it straight into the 'is not
' comparison. This isn't
a general transformation, for various reasons; 'return not a >
10
' won't be similarly translated to the bytecode equivalent of
'return a <= 10
'.
(CPython does go the extra distance to translate 'not a is not 10
'
into 'a is 10
'. I'm a little bit surprised, since I wouldn't expect
people to write that very often.)
PS: One advantage of 'is
' being a keyword is that it allows CPython
to do this transformation, since CPython always knows what 'is
'
does here. It wouldn't be safe to transform a hypothetical 'not
isidentity(a, 10)
' in the same way, since what isidentity
does
could always be changed by rebinding the name.
The place of the 'is
' syntax in Python
Over on Twitter, I said:
A Python cold take (given how long it's taken me to arrive at it): 'is' should not be a keyword, it should be a built-in function that you're discouraged from using unless you really know what you're doing. As a keyword it's too tempting.
Python has two versions of equality, in ==
,
which is plain equality, and is
, which is object identity; 'a is b
'
is true if and only if a
and b
refer to the same object. Since the
distinction between names and values is fundamental
to Python, we definitely need a way of testing this (for example, to
explore a puzzling mistake I once made).
However, I'm not so sure it should be a language keyword.
The issue with 'is
' as a language keyword is that it makes using
object identity temptingly easy; after all, there's a keyword for
it, part of the language syntax. It's as if you're supposed to use
it. The first problem with this is simply that object identity is
a relatively advanced Python concept, one that's a bit tricky to
get your head around. Python code that genuinely needs to use is
instead of ==
is almost invariably doing something tricky, and
we should generally avoid inviting people to routinely write code
that at least looks like tricky code. The second problem is that
in practice object identity can be tricky because Python implementations
(especially CPython) can quietly make objects be the same thing
(and thus 'a is b
' will be true) when you didn't expect them to
be. It's possible to write safe code that uses 'is
', but you need
to know a fair bit about what you're doing; perfectly sensible
looking code can conceal subtle bugs.
(When Python will give you the same object for two apparently different things depends on the specific version of (C)Python and also sometimes the exact way that you created the objects. It can get quite weird and involved.)
There are at least two reasons I can think of to still have is
as a
keyword. The first is that as a keyword, what it does is guaranteed
by the language and is not subject to being modified by people who
play games with namespaces in the way that, say, isinstance()
can
be changed. Changing what isinstance()
does by defining your own
version is probably a terrible idea, but you can do it if you feel the
urge. Meanwhile, is
is beyond the reach of anything but bytecode
rewriting. The second is that because is
is part of the language and
isn't subject to being changed, it can be implemented in a way that
makes it faster than a built-in function. Built-in functions need to
go through a global name lookup when they're used,
just in case, while is
can be just done directly since it's part of
the language.
(Local variables are fast because they avoid this lookup.)
PS: Of course by now all of this is entirely theoretical. It's
entirely too late for Python to drop 'is
' as a keyword, and even
thinking about it is a bit silly. But I apparently twitch a bit
when I see 'is
' casually used in code examples, and that's sort
of what inspired the tweet that led to this entry.