2011-04-21
Nailing down new-style classes and types in Python
Since I keep confusing myself, it's time to write this stuff down once and for all to make sure I have it straight (even if some or all of it is in the official documentation).
One writes Python code to define classes; it's right there in the
language syntax, where you write 'class A(object): ...
'. Defining a
class creates a type object for that class, which is an instance of
type
; this C-level object holds necessary information about the class
and how it's actually implemented. This type object is
what is bound to the class name; if you define a class A
, 'type(A)
'
will then report <type 'type'>
.
Classes have a class inheritance hierarchy, which is ultimately rooted
at object
(including for C-level classes). However, strictly speaking
there is no type hierarchy as far as I know; all types are simply
instances of type
(including type
itself). Further, the type
non-hierarchy is of course unrelated to the class hierarchy. This means
that isinstance(A, type)
is True but issubclass(A, type)
is both
False and the wrong question (unless you really do have a subclass of
type
somewhere in your code).
(Among other things I believe that this means that 'type(type(obj))
'
is always 'type
' for any arbitrary Python object, since all objects
have a type and all types are instances of type
.)
The Python documentation sometimes talks
about a 'type hierarchy'. What it means is either 'the conceptual
hierarchy of various built-in types', such as the various forms of
numbers, mutable sequences, and so on, or 'the class inheritance
hierarchy of built-in types', since a few are subclasses of others and
everyone is a subclass of object
.
(Some languages really do have a hierarchy of all types, with real
(abstract) types for things like 'all numeric types' or 'all mutable
sequence types', but Python does not. You can see this by inspecting the
__mro__
attribute on built in types to see the classes involved in
their method resolution order; the MRO of a
type like int
is just itself and object
. Only a few built in types
are subclasses of other types.)
PS: yes, almost all of this is in the Python documentation or is implied by it. Writing it down anyways helps me get it straight in my own head.
PPS: I believe that technically it would be possible for a sufficiently
perverse extension module to create a valid new style C-level class
that was not a subclass of object
. Don't do that, and if you did I
expect that things would blow up sooner or later.
Sidebar: the real difference between classes and types
If you use repr()
on user-defined classes and on built in types (eg
'repr(A)
' and 'repr(str)
'), you'll notice that it reports them
differently. This is a bit odd once you think about it, since they are
both instances of type
and so are using the same repr()
function,
yet one reports it is a 'class' and the other reports it is a 'type'.
In CPython, the difference between the two is whether the C-level
type instance structure is flagged as having been allocated on the
heap or not. A heap-allocated type instance is a class as far as
type.__repr__()
is concerned; a statically allocated one is a
type. All classes defined in Python are allocated on the heap, like all
other Python-created objects, and so report as classes. Most 'types'
defined in C-level extension modules are statically defined and so get
called types, but I believe that with sufficient work you could create a
C-level type that had a heap allocated type instance and was reported as
a class.
(It's easy enough to keep it from being garbage collected out from underneath your extension module; you just artificially increase its reference count.)
How CPython implements __slots__
(part 1): storage
At an abstract level, each instance of a conventional class has a
__dict__
member that is a conventional Python dictionary, and
instance attributes are created and manipulated by manipulating
this dictionary; the dictionary key is the attribute name and the
value is the attribute's value. __slots__
eliminates this
dictionary and instead has a fixed list of attributes that instances
of the class know about. All of this is in the documentation. What the
documentation won't tell you is how the machine level storage for
all of this actually works. That's what today's entry is about.
In CPython, class instances start out as a more or less opaque C
structure that is specific to the C-level type that your class inherits
from (we saw this before). However, the general
CPython type infrastructure for new-style classes reserves the right to
add some extra space on the end of your type's opaque blob for its own
purposes.
If your class has a __slots__
, this code adds some extra
space after the C structure blob to store what is effectively an array
of pointers to Python objects. These entries are used to point to the
values of each __slots__
attribute (if there is no value set, the
corresponding entry is NULL and the CPython code reacts appropriately).
While somewhat complicated, this approach minimizes the memory overhead for class instances. If you allocated the array of slot value pointers separately, you would have a second memory allocation and you'd need an extra pointer in the base object structure to point to the separate array. And because all instances of the class have exactly the same slots, you can put all information on the names of slots and how to access them on the class, instead of having to have it also attached on the instance.
If you have a class that both has a non-empty __slots__
and tries to
inherit from certain built in types, you will get the error:
nonempty __slots__ not supported for subtype of '<type>'
The Python documentation mentions this but does not explain the details of what is going on, which have to do with this storage approach.
Most C-level types have a fixed size C structure; however, the type infrastructure has general support for types that have a fixed size header structure plus some number of (fixed size) items immediately after the header. Because the information on how to access slot values is attached to the class, not the instance, the CPython code requires that all slot value pointers have a constant offset from the start of the instance object. This requires that all instance objects for a type have the same fixed size, which is not the case for instances of 'base + items' C-level types. Hence the message you get here.
You can still have an empty __slots__
even for 'base + items'
types, because this doesn't require allocating any slot value pointers;
it just turns off the creation of the __dict__
dictionary.
(Well, usually.)
Sidebar: how __dict__
itself is (usually) implemented
One might innocently think that __dict__
would be implemented by
having something like an ob_dict
pointer in the basic Python C-level
object structure. As it happens, CPython is both more clever and more
sleazy than this. The storage for the pointer to the __dict__
dictionary is actually usually created through this same 'add things
on the end of the type's blob' code, and the C structure for the type
itself has a field that says what offset this pointer is to be found
at. This saves a pointer when __slots__
turns off __dict__
and
probably has other implementation advantages that I don't know about.
You might wonder how this works for base + items types. That's where the
sleaze comes in: CPython has special magic support to make this work for
the __dict__
offset. If I'm reading the code right, it switches to
indexing the offset from the end of the object instead of the start.
(If you want the gory details, see _PyObject_GetDictPtr
in
Objects/object.c
in the CPython source code.)
If you want to see some of this sausage's insides, look at the
__dictoffset__
attribute of any new-style class. For bonus points,
create a class that inherits from, say, str
and then look at its
__dictoffset__
. Note that almost all built-in types will show a 0
for this value for reasons that do not fit into this sidebar.