2014-01-31
Why I now believe that duck typed metaclasses are impossible in CPython
As I mentioned in my entry on fake versus real metaclasses, I've wound up a bit obsessed with the question
of whether it's possible to create a fully functional metaclass
that doesn't inherit from type. Call this a 'duck typed metaclass'
or if you want to be cute, a 'duck typed type' (DTT). As a result
of that earlier entry and some additional
exploration I now believe that it's impossible.
Let's go back to MetaclassFakeVsReal for a moment and look at the
fake metaclass M2:
class M2(object):
def __new__(self, name, bases, dct):
print "M2", name
return type(name, bases, dct)
class C2(object):
__metaclass__ = M2
class C4(C2):
pass
As we discovered, the problem is that C2 is not an instance of M2
and so (among other things) its subclass C4 will not invoke M2 when
it is being created. The real metaclass M1 avoided this problem by
instead using type.__new()__ in its __new__ method. So why
not work around the problem by making M2 do so too, like this:
class M2(object):
def __new__(self, name, bases, dct):
print "M2", name
return type.__new__(self, name, bases, dct)
Here's why:
TypeError: Error when calling the metaclass bases
type.__new__(M2): M2 is not a subtype of type
I believe that this is an old friend in a new
guise. Instances of M2 would normally be based on the C-level
structure for object (since it is a subclass of object), which
is not compatible with the C-level type structure that instances
of type and its subclasses need to use. So type says 'you cannot
do this' and walks away.
Given that we need C2 to be an instance of M2 so that things work
right for subclasses of C2 and we can't use type, we can try brute
force and fakery:
class M2(object):
def __new__(self, name, bases, dct):
print "M2", name
r = super(M2, self).__new__()
r.__dict__.update(dct)
r.__bases__ = bases
return r
This looks like it works in that C4 will now get created by M2.
However this is an illusion and I'll give you two examples of the
ensuing problems, each equally fatal.
Our first problem is creating instances of C2, ie the actual
objects that we will want to use in code. Instance creation is
fundamentally done by calling C2(), which means that M2 needs a
__call__ special method (so that C2, an instance of M2, becomes
callable). We'll try a version that delegates all of the work to type:
def __call__(self, *args, **kwargs):
print "M2 call", self, args, kwargs
return type.__call__(self, *args, **kwargs)
Unsurprisingly but unfortunately this doesn't work:
TypeError: descriptor '__call__' requires a 'type' object but received a 'M2'
Okay, fine, we'll try more or less the same trick as before (which is now very dodgy, but ignore that for now):
def __call__(self, *args, **kwargs):
print "M2 call", self, args, kwargs
r = super(M2, self).__new__(self)
r.__init__(*args, **kwargs)
return r
You can probably guess what's coming:
TypeError: object.__new__(X): X is not a type object (M2)
We are now well and truly up the creek because classes are the only
thing in CPython that can have instances. Classes are instances of
type and as we've seen we can't create something that is both an
instance of M2 (so that M2 is a real metaclass instead of a fake
one) and an instance of type. Classes without instances are obviously
not actually functional.
The other problem is that despite how it appears C4 is not actually
a subclass of C2 because of course classes are the only thing
in CPython that can have subclasses. In specific, attribute lookups
on even C4 itself will not look at attributes on C2:
>>> C2.dog = 10 >>> C4.dog AttributeError: 'M2' object has no attribute 'dog'
The __bases__ attribute that M2.__new__ glued on C4 (and C2)
is purely decorative. Again, looking attributes up through the chain of
bases (and the entire method resolution order)
is something that happens through code that is specific to instances of
type. I believe that much of it lives under the C-level function that
is type.__getattribute__, but some of it may be even more magically
intertwined into the guts of the CPython interpreter than that. And as
we've seen, we can't call type.__getattribute__ ourselves unless we
have something that is an instance of type.
Note that there is literally no attributes we can set on non-type
instances that will change this. On actual instances of type, things
like __bases__ and __mro__ are not actual attributes but are
instead essentially descriptors that look up and manipulate fields
in the C-level type struct. The actual code that does things like
attribute lookups uses the C-level struct fields directly, which is one
reason it requires genuine type instances; only genuine instances even
have those struct fields at the right places in memory.
(Note that attribute inheritance in subclasses is far from the only
attribute lookup problem we have. Consider accessing C2.afunction
and what you'd get back.)
Either problem is fatal, never mind both of them at once (and note
that our M2.__call__ is nowhere near a complete emulation of
what type.__call__ actually does). Thus as far as I can tell
there is absolutely no way to create a fully functional duck typed
metaclass in CPython. To do one you'd need access to the methods
and other machinery of type and type reserves that machinery
for things that are instances of type (for good reason).
I don't think that there's anything in general Python semantics that
require this, so another Python implementation might allow or support
enough to enable duck typed metaclasses. What blocks us in CPython is
how CPython implements type, object, and various core functionality
such as creating instances and doing attribute lookups.
(I tried this with PyPy and it failed with a different set of errors
depending on which bits of type I was trying to use. I don't have
convenient access to any other Python implementations.)
2014-01-21
Fake versus real metaclasses and what a fully functional metaclass is
Lately I've become a little bit obsessed with the question of whether
you can create a fully functional metaclass that doesn't inherit
from type (partly this was sparked by an @eevee tweet, although
it's an issue I brushed against a while back).
It's not so much that I want to do this or think that it's sensible
as that I can't prove what the answer is either way and that bugs
me. But before I try to tackle the big issues I want to talk about
what I mean by 'fully functional metaclass'.
Let's start with some very simple metaclasses, one of which inherits
from type and one of which doesn't:
class M1(type):
def __new__(self, name, bases, dct):
print "M1", name
return super(M1, self).__new__(self, name, bases, dct)
class M2(object):
def __new__(self, name, bases, dct):
print "M2", name
return type(name, bases, dct)
class C1(object):
__metaclass__ = M1
class C2(object):
__metaclass__ = M2
M2 certainly looks like a metaclass despite not inheriting from type
(eg if you try this out you can see that it is triggered on the creation
of C2). But appearances are deceiving. M2 is not a fully functional
metaclass (and there are ways to demonstrate this). So let me show you
what's really going on:
>>> type(C1) <class 'M1'> >>> type(C2) <type 'type'>
(We can get the same information by looking at each class's __class__
attribute.)
The type of a class with a metaclass is the metaclass while the
type of a class without a metaclass is type, and as we can see
from this, C2 doesn't actually have a metaclass. The reason for
this is that M2 created the actual class object for C2 by calling
type() directly, which does not give the newly created class a
metaclass (instead it becomes a direct instance of type). If all
you're interested in is changing a class as it's being created this may not matter, or at least you may not
notice any side effects if you don't subclass your equivalent of
C2.
In this example M1 is what I call a fully functional metaclass and
M2 is not. It looks like one and partly acts like one, but that is an
illusion; at best it can do only one of the many things metaclasses
can do. A fully functional metaclass like M1 can do
all of them.
Now let's come back to a demonstration that M2 is not a real
metaclass. The most alarming way to demonstrate this is to subclass
both classes:
class C3(C1): pass class C4(C2): pass
If you try this out you'll see that M1 is triggered when C3 is
created but M2 is not triggered when C4 is created.
This is very confusing because C4 (and C2 for that matter) has
a visible __metaclass__ attribute. It's just not meaningful
after the creation of C2, contrary to what some documentation
sometimes says. Note that this is sort of documented if you read
Customizing class creation
very carefully; see the section on precedence rules, which only
talks about looking at a __metaclass__ attribute in the actual
class dictionary, not the class dictionaries of any base classes.
Note that this means that general callables cannot be true
metaclasses. To create a true metaclass, one that will be inherited
by subclasses, you must arrange for the created classes to be
instances of you, and only classes can have instances. If you have a
__metaclass__ of, say, a function, it will be called only when
classes explicitly list it as their metaclass; it will not be called for
subclasses. This is going to surprise everyone except experts in Python
arcana, so don't do that even if you think you have a use for it.
(If you do want to customize only classes that explicitly specify a
__metaclass__ attribute, do this check in your __new__ function
by looking at the passed in dictionary. Then people who read the code of
your metaclass have a chance of working out what's going on.)
I will admit that Python 3 cleaned this up by removing the magic
__metaclass__ attribute. Now you can't be misled quite as
much by the presence of C2.__metaclass__ and the visibility of
C4.__metaclass__. To determine whether something really has a
metaclass in Python 3 you have no choice but to look at type(),
which is always honest.
2014-01-16
Link: Armin Ronacher's 'More About Unicode in Python 2 and 3'
Armin Ronacher's More About Unicode in Python 2 and 3 contains a lot of information about the subject from someone who works with this stuff and so is much better informed about it in practice than I am. A sample quote:
I will use this post to show that from the pure design of the language and standard library why Python 2 the better language for dealing with text and bytes.
Since I have to maintain lots of code that deals exactly with the path between Unicode and bytes this regression from 2 to 3 has caused me lots of grief. Especially when I see slides by core Python maintainers about how I should trust them that 3.3 is better than 2.7 makes me more than angry.
I learned at least two surprising things from reading this. The first was that I hadn't previously realized that string formatting is not available for bytes in Python 3, only for Unicode strings. The second is that Mercurial has not and is not being ported to Python 3. As Ronacher notes, it turns out that these two issues are not unrelated.
For me, the lack of formatting for bytes adds another reason for not using Python 3 even for new code because it forces me into more Unicode conversion even if I know exactly what I'm doing with those unconverted bytes. Since I use Unix, with its large collection of non-Unicode byte APIs, there are times when this matters.
(For instance, it is perfectly sensible to manipulate Unix file paths as bytes without trying to convert them to Unicode. You can split them into path components, add prefixes and suffixes, and so on all without having to interpret the character sets of the file name components. In fact, in degenerate situations the file name components may be in different character sets, with a directory name in UTF-8 and file name inside a subdirectory in something else. At that point there is no way to properly decode the entire file path to meaningful Unicode. But I digress from Armin Ronacher's article.)
2014-01-06
The problem with compiling your own version of Python 3
I've previously mentioned in passing that I simply can't use Python 3 on some platforms because it's not there (or perhaps it's there only in an old and lacking version). As reported by Twirrim, in some places the popular answer to this issue is to say that I should just compile my own local version of Python 3 by hand (perhaps in a virtualenv). At this point most of the sysadmins in the audience are starting to get out of their chairs, but hold on for a moment; I want to make a general argument.
There is a spectrum of Python coding that ranges from big core systems that are very important down to casual utilities. For something that is already big and complex, the extra overhead of compiling a specific version of Python is small (you've probably already got complex installation and management procedures even if you've automated them) and can potentially increase the reliability of the result. Nor is the extra disk space for another copy of the Python interpreter likely to be a problem; even if the disk space used by your system doesn't dwarf it, your core system is important enough that the disk space doesn't matter. But all of this turns on its head for sufficiently little and casual utilities; because they're so small, building and maintaining a custom Python interpreter drastically increases the amount of effort required for them as a total system.
Somewhere on the line between big core systems and little casual utilities is an inflection point where the pragmatic costs of a custom Python interpreter exceed either or both of the benefits from the utilities and the benefits from using Python 3 instead of whatever version of Python 2 you already have. Once you hit it, 'install Python 3 to use this' ceases being a viable approach. Where exactly this inflection point is varies based on local conditions (including how you feel about various issues involved), but I argue that it always exists. So there are always going to be situations where you can't use your own version of Python 3 because the costs exceed the benefits.
(With that settled, the sysadmins can now come out of their chairs to argue about exactly where the inflection point is and point out any number of issues that push it much further up the scale of systems than one might expect.)
2014-01-03
What determines Python 2's remaining lifetime?
In light of recent things one can sensibly ask just how long people can keep using Python 2. At one level the answer is 'until there is some significant problem with Python 2 that the Python developers don't fix'. This could either be a security issue or something important that Python 2 and its core modules don't support (eg, at one point IPv6 would have been such a thing). At the moment it isn't clear how long the Python developers will be fixing things in Python 2; the most recent release was 2.7.6, made November 10th. However here asserts that the Python developers will provide full support until 2015 and probably security fixes afterwards.
But this is an incomplete view of the situation because the Python developers themselves aren't the only people involved; there's also the OS distributions that package some version of Python. Many of these OS distributions actually depend on Python themselves for system tools. If the Python developers abandon fixing Python 2, the OS distributors may well have no choice but to take over. So we can ask a related question of how close OS distributions are to moving away from Python 2.
I will skip to the summary; just as with last time the news is not good for Python 2 going away any time soon. Ubuntu will likely miss their migration target for 14.04 LTS, leaving them needing Python 2 on 14.04 LTS until 2019. Fedora won't transition for at least a year (ie Fedora 22), which means that Red Hat Enterprise Linux 7 will almost certainly ship at some point in 2014 with system tools depending on Python 2, leaving Red Hat supporting Python 2 until at least 2019 and likely later.
(FreeBSD, OmniOS, and so on are less interesting here because as far as I know none of them are promising long support periods the way RHEL and Ubuntu LTS are. However I believe that all of them are still shipping Python 2 as the default and I know that OmniOS has tools that depend on it.)
So my answer is that in practice it's highly likely that Python 2 will get important updates through at least 2020, whether this is from the Python developers themselves or from Linux distributions with long term support needs forking the code and doing their own fixes. Given what Alex Gaynor reports, some of this will likely be driven by customer demand, ie people will likely be deploying Python 2 systems on RHEL 7 and Ubuntu 14.04 LTS and will want any important Python fixes for them.
2014-01-02
Python 3's core compatibility problem and decision
In light of the still ongoing issues with Python 3,
a very interesting question is what makes moving code to Python 3
so difficult. After all, Python has made transitions before, even
transitions with little or no backwards compatibility, and given
that there is very little real difference between Python 2 and 3 you would normally expect that people would have
no real problems shifting code across to Python 3. After all, things
like changing from using print to using print() are not really a
big deal.
In my view almost all of the Python 3 issues come down to one decision
(or really one aspect of that decision): making all strings Unicode
strings, with no real backwards path to working with bytes. Working
with Unicode strings instead of byte blobs is a fundamental change
to the structure of most programs. Moving many
programs to Python 3 requires changing them to be programs that
fundamentally work in Unicode and this itself can require a whole host
of changes throughout the program's data structures and interfaces, as
well as require you to explicitly consider failure points that you
didn't need to before. It is this design rethink
that is the hard part about moving code, not the mechanical parts of eg
changing print to print().
I think that this is also why there is a significant gulf between
different people's experiences of working with Python 3. Some people
have already been writing code that worked internally in Unicode,
even in Python 2. This code is easy to move to Python 3 because it
has already dealt with the major issue of that conversion; all
that's left is more or less mechanical issues like print versus
print(). Other people, with me as an example, have a great deal
of code that is really character encoding indifferent (well, more or less) and
as a result need to redesign its core for Python 3.
(I think that this also contributes to a certain amount of arrogance on the side of Python 3 boosters, in that they may feel that anyone who was 'sloppily' not working in Unicode already was doing it wrong and so is simply paying the price for their earlier bad habits. To put it one way, this is missing the real problem as usual, never mind the practical issues involved on Unix.)
Honesty requires me to admit that this is a somewhat theoretical view in that I haven't attempted to convert any of my code to Python 3. This is however the aspect of the conversion that I'm most worried about and that strikes me as what would cause me the most problems.
Sidebar: a major contributing factor
I feel that one major factor that makes the Unicode decision a bigger issue than it would otherwise be is that Python 3 doesn't just make literal strings into Unicode, it makes a lot of routines that previously returned bytestrings instead return Unicode strings. Many of these routines are naturally dealing with byte blobs and forced conversion to Unicode thus creates new encoding-related failure points. It is this decision that forces design changes on programs that would otherwise be indifferent to the issues involved because they only shove the byte blobs involved around without looking inside them.
My impression is that Python 3 has gotten somewhat better about this since its initial release in that many more things are willing to work with bytes or can be coerced to return bytes if you find the right magic options. This still leaves you to go through your code to find all of the places that this is needed (and to hope you don't miss a rarely executed code path), and sometimes revising code to account for what the new options really mean.
(For example, you now get bytes out of files by opening them in
"rb" mode. However this mode has potentially important behavior
differences from the Python 2 plain "r" mode; it does no newline
conversion and is buffered differently on ttys.)