|
2012-01-27
Why metaclasses work in Python
I've covered what you can do with metaclasses (1,
2, 3, 4)
and even, sort of, the low level details of how they work
(1, 2, 3). But I've never covered the high level view
of why metaclasses work, ie what overall Python features make them go
(partly because I am so immersed in Python arcana that much of that
stuff feels obvious to me, although I doubt it actually is).
To start with, in Python everything is an object and all objects
are an instance of something (yes, there are spots where this gets
recursive). This includes even things that you wouldn't normally think
of as objects, such as functions. Crucially, this includes classes:
classes are objects. Any time you have an object in Python, a lot
of its behavior is usually provided by whatever it is an instance of (to
avoid confusion, I'll call this the type of the object). Classes are no
exception to this; a lot of how classes behave is handled by their type,
even things like how a new object gets created when you call the class.
(For simplicity, I'm going to ignore old-style Python 1.x classes from
here onwards and assume that all classes are new-style Python 2 classes
that ultimately subclass object.)
To avoid a point of confusion: classes have ancestor ('base') classes
that they inherit from (or just object(), the root class). However,
classes are not instances of their base class; we can see why this
has to be when we note that a class can inherit from multiple base
classes. You can't be an instance of several different things at once.
So classes exist in a two-dimensional relationship; they inherit from
one or more base classes, and at the same time they are instances of
something that provides much of their 'class' behavior. The type
of classes (the thing that provides the 'class' behavior) is called
type().
(This two dimensional structure can get a bit weird.)
In some languages, the creation of classes is black magic that happens
deep in the interpreter and isn't something you can do inside the
language (even if the classes are visible as objects). Python has
instead chosen to expose the ability to create classes by hand; you
you can do this by calling type() with the right arguments (and then
binding the class object to a name), just as you
create instances of normal classes by calling the class itself. As part
of creating classes yourself by hand, you can obviously manipulate
class creation; you can create a new class with whatever methods, base
classes, and so on you want.
(What's odd about type() is that despite it being a class, you can
call it with a single object to get the type of the object.)
Python is also an unusual language in another way; in Python, things
like defining functions and classes are themselves executable
statements. Python doesn't parse your program,
create all the functions and classes, and then start running your code;
instead it starts running your code and things like def and class
execute on the fly (as does import and so on). So it's natural to have
your code running as classes are being created.
The combination of these two things means that Python can easily provide
a way to hook your own code into the process of creating the class
objects for classes that are written in straight Python, with 'class
X(object): ....'. Python is already running code in general when this
happens, and the mechanisms of creating classes by hand means it's
relatively easy for Python to hand you the bits of the class-to-be so
you can modify it and then have everything continue onwards to create a
new class. This is why metaclasses can change classes as they are being
created.
The other half of why metaclasses work is that Python allows classes to
be instances of something other than type(). Since classes get a lot
of their 'class' behavior through normal instance method inheritance
from type(), a class being an instance of something other than
type() lets the other thing intercept or change the normal as-a-class
behavior for that class (for example, what happens when you call the
class). This is why metaclasses can do things with a class after the
class has been created.
WhyMetaclassesWork written at 00:39:39; Add Comment
2012-01-16
Understanding isinstance() on Python classes
Suppose that you have:
class A(object):
pass
class B(A):
pass
As previously mentioned, the type of classes is
type, which is to say that class objects are instances of type:
>>> isinstance(A, type)
True
>>> isinstance(B, type)
True
Both A and B are clearly subclasses of object; A is a direct subclass
and B is indirectly a subclass through A. In fact every new-style Python
class is a subclass of object, since object is the root of the
class inheritance tree. However, class type is not the same as class
inheritance:
>>> issubclass(B, A)
True
>>> isinstance(B, A)
False
Although B is a subclass of A, it is not an instance of A; it is a
direct instance of type (we can see this with 'type(B)'). Now,
given that A and B are instances of type, one might expect that they
would not be instances of object since they merely inherit from it, as
B inherits from A:
>>> isinstance(A, object)
True
Well, how about that. We're wrong (well, I'm wrong, you may already
have known the correct answer). Here is why:
>>> issubclass(type, object)
True
A and B are instances of type and, like all other classes and types,
type is a subclass of object. So A and B are also instances of
object (at least in an abstract, Python level view of things), in
the same way that an instance of B would also be an instance of A.
I believe that this implies that 'isinstance(X, object)' is always
true for anything involved in the new-style Python object system. The
corollary is that this is an (almost) surefire test to see if the random
object you are dealing with is an old style class or an instance of one:
class C:
pass
>>> issubclass(C, object)
False
>>> isinstance(C, object)
False
(This goes away in Python 3, where there is only new-style classes
and there is much rejoicing, along with people no longer having to
explicitly inherit from object for everything.)
PS: as originally noted by Peter Donis on a comment here, object is also an instance of type because
object is itself a class. type is an instance of itself in addition
to being a subclass of object. Try not to think about the recursion
too much.
(This isinstance() surprise is an easy thing to get wrong, which is
why I'm writing it down; I almost made this mistake in another entry I'm
working on.)
Sidebar: isinstance() and metaclasses
If A (or B) has a metaclass, it is an instance of the metaclass instead of a direct instance of type. In any
sane Python program, 'isinstance(A, type)' will continue to be True
because A's metaclass will itself be a subclass of type.
(I'm not even sure it's possible to create a working metaclass
class that doesn't directly or indirectly subclass type (cf), but I'm not going to bet against it.)
This implies that I was dead wrong when I said, back in ClassesAndTypes,
that 'type(type(obj))' would always be 'type' for any arbitrary
Python object, as Daniel Martin noted at the time and I never
acknowledged (my bad). In the presence of metaclasses, type(type(obj))
can be the metaclass instead of type itself. Since metaclasses can
themselves have metaclasses, so there is no guarantee that any fixed
number of type() invocations will wind up at type.
ClassesAndIsinstance written at 22:32:55; Add Comment
2012-01-02
An example sort that needs a comparison function
In reaction to my entry on Python 3 dropping the comparison function
for sorting, some people may feel that a
sorting order that is neither simple field-based nor based on a computed
'distance' (the two cases easily handled by a key function) is
unrealistic. As it happens I can give you a great example of a sort
order that cannot be handled in any other way: software package versions
on Linux systems.
For simplicity (and because I know RPM best), I'm going to talk about
RPM-based version numbers. RPM version numbers have three components, an
epoch, a version, and a release, and ordering is based on comparing each
successive component in turn. The epoch is a simple numeric comparison
(higher epochs are more recent), but both the version and release can
have sub-components and each sub-component must be compared piecewise
using a relatively complex comparison for each piece (they can be all
digits, letters, or mixed letters and digits). Something with extra
sub-components is more recent than something without it, so version
1.6.1 is more recent than version 1.6. A full package version can look
like '1:2.4.6-4.fc16.cks.0'; '1:' denotes the epoch, the version is
'2.4.6', and the release is '4.fc16.cks.0'.
(Most RPM packages have an epoch of '1' '0', which is
conventionally omitted when reporting package versions.)
In the presence of potential letter-based subcomponents and the complex
comparison rules, you can't compare these version numbers using simple
field-based rules, not even if you split sub-components up into tuples
and then compare a tuple-of-tuples (it's possible if all sub-components
are simple numbers). Nor can you compute some sort of single numerical
'distance' value for a particular version number, especially since
version numbers are sort of like the rational numbers in that you can
always add an essentially unlimited number of additional versions
between any two apparently adjacent versions. The only real operation
you have is a pure comparison, where you answer the question 'is X
a higher version than Y', and this comparison requires relatively
intricate code.
(Having said that, DanielMartin showed a nice way to transform things so
that a key-function based sort can be used for a comparison function
sort in comments on the earlier entry.)
ExampleSortComparison written at 01:49:41; Add Comment
2011-12-30
Why I don't like Python 3 dropping the comparison function for sorting
One of the changes that Python 3 has made is that, to quote
the documentation:
builtin.sorted() and list.sort() no longer accept the cmp
argument providing a comparison function. Use the key argument
instead. [...]
I feel unreasonably annoyed about this change. At least on the surface
there's no obvious reason why; basically all of the uses of a comparison
function I've ever used are to pick a specific field out, and that's
handled much better by the key argument. However, I've recently
figured out what irritates me about this: it couples data and behavior
too closely.
In the new world, there are three ways to create a sort ordering. If
your ordering depends on explicit fields (possibly modified), you can
use a straightforward key function. If the ordering of a data element
is strictly computable from a single element (for example, a 'distance'
metric that's easy to determine), you can use a key function which
synthetically computes an element's ordering and returns it. And if
neither of these holds and you can only really determine a relative
ordering, you can define a __lt__ method on your objects.
The problem with the last approach is that, of course, you can only have
one __lt__ method and thus only one sort ordering. What's happened
is that you've been forced to couple the raw data with the behavior of
a particular sort ordering. Getting around this requires various hacks,
such as synthetic wrapper objects with different __lt__ functions.
(The other problem is that your data needs to be actual objects. While
this is usually the case for anything complex enough that you only
can do a relative ordering, sometimes you're getting the data from an
outside source and it would be handy to leave it in its native form.)
While this is only a theoretical concern for me, it still irritates me
a bit that Python 3 has chosen to move towards closer, less flexible
coupling between data and ordering. I maintain that the two are separate
and we can see this in the fact that there are many possible orderings
for complex data depending on what you want to do with it.
By the way, I can see several reasons why Python 3 did this and I
sympathize with them (even if I still don't like dropping cmp). The
Python 3 documentation notes that key is more efficient since
it's called only once per object you're sorting. On top of that,
it's relatively easy to make mistakes with complex cmp functions
that create inconsistent ordering, which potentially causes sorting
algorithms to malfunction mysteriously.
Python3SortCmpFunction written at 02:03:10; Add Comment
2011-12-27
Python 3 from the perspective of someone writing new Python code
I've talked about Python 3 from the perspective of a Unix sysadmin and Python 3 from the perspective of someone with
existing Python 3 code; now it's time for the
final viewpoint, that of someone writing new code.
There are a bunch of practical difficulties with this, things like
having Python 3 installed on machines and third party modules being
ported to Python 3, but they're either gone or going away (and most of
what I write doesn't depend on third party modules). Ignoring those
issues as ultimately unimportant, I don't think there's any reason not
to write new, non-sysadmin code in Python 3. It's clearly the future
of Python and although I may grump about some decisions, there's a fair
amount to like about it. Yes it's different but much of that difference
is good.
(I've made a vaguely similar transition in Python programming before,
when I moved from 1.x to 2.x. It was a more backwards compatible change
and I felt it was less wrenching, but it had just the same sort of
generally neat new things in the new version. Today, for example, if I
write an old-style class it's by accident.)
I have to admit that this is a theoretical view right now, because I
haven't tried to write anything new in Python 3 yet. Most of what I've
written recently is sysadmin tools and those need to be in Python 2
for the foreseeable future. But the next time I come up with a Python
program to write I'm going to keep this in mind and try to write it in
Python 3 instead of Python 2, no matter what my inertia is saying.
(A good step would be to make sure that as many of our machines as
possible actually have Python 3 installed. Now that I look, some of them
don't have it installed by default, which isn't going to help Python 3's
adoption any.)
PS: the one Python 3 change that's going to be irritating me for years
is the whole Unicode-ification of everything in sight. This deserves
a longer discussion than fits within the margins of this entry and
besides, this entry is a positive one. Also, I suspect that once I start
actually using Python 3, the Unicode stuff will prove to be less of a
pain than I currently expect it to be.
Python3NewCode written at 03:41:11; Add Comment
2011-12-21
Python 3 from the perspective of someone with existing Python code
Last time, I talked about Python 3 from the perspective of a Unix
sysadmin. Today I want to talk about Python 3 from the
perspective of someone who has a not insignificant amount of current
Python code. I don't have huge (by Python standards) programs, but I do
have various things (not all large) currently running
live, for real, doing things that I care about.
Recently I read Armin Ronacher's Thoughts on Python 3, where
he wrote (among other things):
Because as it stands, Python 3 is the XHTML of the programming
language world. It's incompatible to what it tries to replace
but does not offer much besides being more "correct".
I'm kind of sad to say this, but what he said (down to the comparison
with XHTML).
Some of my code has a decent amount of tests but not all of it, and all
of it currently works. Migrating it to Python 3 requires a significant
amount of effort and testing, even for the code that has tests, and in
exchange I get basically nothing except a warm fuzzy feeling that I am
'modern'. It would be pure make-work. Worse, it would be make-work that
runs a good risk of destabilizing working code.
There are two aspects to the problem. The first is simply that Python 3
is a big change from Python 2. I'm willing to make small or moderate
changes purely for compatibility purposes, but I've certainly been left
with the impression that Python 3 requires some significant changes
(even if a number of them will work in Python 2.7, the issue is the
amount of changes to the current code). The second is that Python 3's
handling of strings and Unicode demand an architectural change in code
that is currently ignoring the issue and just shoving around plain byte
strings, which describes all of my current code. Part of this is just
switching to Unicode by itself, but part of it is that since conversions
to and from Unicode can fail I now need to find all of these places and
figure out what I want to do.
(This also increases the risk of the changes. If I miss a place where
a conversion can fail, my code may blow up at some point in the future
with uncaught exceptions in a situation where it works today. This is
not really an attractive selling point and yes, I would rather have
mojibake than explosive
failures. Among other reasons, to a first order approximation mojibake
is caused by someone else's mistake while uncaught exceptions are clearly
my fault.)
The result is that I can't possibly justify migrating any significant
amount of my current code to Python 3 (either to myself or to others).
It will remain Python 2 code unless and until I have no choice, and if I
stop having a choice I'm going to fiercely resent it.
(This is entirely apart from any pragmatic issues such as dependencies
that haven't yet been ported to Python 3. Most of my code doesn't use
third-party modules or code anyways, just standard library stuff.)
Python3ExistingCode written at 22:53:14; Add Comment
2011-12-17
Python 3 from the perspective of a Unix sysadmin
I've been thinking about Python 3 for a while, mulling over things like
how I feel about it and how likely I am to use it, and I've decided that
one reason my feelings are complex is that I have three different views
of it, from three different perspectives. Today is the day for the first
perspective: Python 3 from the perspective of a Unix sysadmin who uses
Python to program important parts of our systems.
I don't have any way to put this nicely, so I'll say it right up front:
for a Unix sysadmin, Python 3 is currently highly radioactive and should
be completely avoided. Our current systems are written in Python 2;
there is no prospect of this changing and I am going to keep writing
sysadmin things in Python 2 for the indefinite future. I will stop this
only when the systems we use stop packaging Python 2, and I certainly
hope that that doesn't happen for, oh, a decade or more.
The fundamental problem is that Python 3 wants the operating system
environment to be Unicode, and Unix is not. When Python 3 comes into
contact with messy reality, bad things happen
and things fail. These failures are vaguely tolerable for ordinary
user programs; they are intolerable for programs used for system
management. I cannot afford to write programs that silently omit names
from os.listdir()'s results, that don't see some environment variables
sometimes, or that die with mysterious error messages if given the wrong
arguments. There are workarounds for some of these issues (but none yet
for the sys.argv issue), but they are limited
in scope and unlikely to be pervasive (in, eg, third party modules that
I want to use).
So long as Python 3 is busy denying Unix reality (and causing all sorts
of complications as a result of this), the sysadmin side of me can't and
isn't going to touch it. I doubt that the Python 3 developers care about
this and I doubt that anything is going to change in Python 3, which is
kind of a pity.
(I could probably write system tools in Python 3 if I wanted to and
tried hard enough and had to, but I don't see any reason to do so
given that Python 2 is there and going to be there for a long time to
come. Python 2 works, it works without huge contortions, and I don't
really see anything compelling in Python 3 so far.)
Sidebar: on the long term availability of Python 2
At this point in time I see essentially no prospect of Python 2 being
removed from Linux distributions in the next five years (minimum). The
very first step along the long path of removing Python 2 would be for
distributions to migrate Python based system tools from Python 2 to
Python 3, and that hasn't even started yet (distributions are just now
starting to talk about maybe moving some of their Python-based tools
to Python 3 for their next release).
The chances of Python 2 disappearing any time soon from more conservative
and slow moving Unixes like FreeBSD and Solaris (and Mac OS X) are best
described as 'laughable'.
Python3Sysadmin written at 02:59:31; Add Comment
2011-12-13
DWiki's code is now on Github (among other things)
As a followup to my first experiment with coding in public, I've put a few other Python projects up on Github.
They are:
I've made an index page for all of my Github things that I intend to keep up to date, or you can
of course just look at things on Github.
DWikiGithub written at 12:01:45; Add Comment
2011-11-25
Python instance dictionaries, attribute names, and memory use
In a comment on my entry on what __slots__ are good for, Max wrote:
On the other hand, having __slots__ saves the strings that the
instance dictionary entries would point to for the attribute names. On
a 4 byte string platform, that adds up quickly too.
Although one might naturally think that this is the case, CPython is
actually sufficiently clever that it is not so; using __slots__ doesn't
save you any memory for attribute names because the string values of
attribute names are already only stored once. However understanding
how and why requires a reasonable amount of knowledge about CPython
internals.
(Or you have to know to look at the documentation for the intern()
function, which
casually mentions this in passing.)
Like many similar languages, Python has string interning and the CPython
internals make liberal use of interned strings for any code-related
string that might look like it's going to be repeated. Attribute names
are one such example of this; starting right in the code itself, all attribute names are fully interned. So you always
have the same set of interned strings for attribute names regardless of
how the attributes are stored and regardless of how many instances of
the class you have.
(This is quite similar to part of the concept of 'symbols' in languages
like Lisp and Ruby, although both of those expose symbols directly to
user-level code.)
More specifically, all names used directly as attributes are interned.
There are a number of ways where you can use real strings as attribute
names and these will not be interned. The most prominent example is
actually __slots__ itself, although things get confusing here. Consider:
class A(object):
__slots__ = ('attrone', 'attrtwo')
def __init__(self):
self.attrone = 10
def report(self):
return self.attrone
The two string literals in __slots__ are not interned. However, the same
string value ('attrone') is interned in __init__ and report(). If
you have lots of code that all refers to '<something>.attrone', all of
it will do all attribute lookups using the same interned string value.
(Note that attribute names are interned globally, not on a
per-class basis or the like. The 'attrone' in the attribute name
module1.cls1.attrone is the same interned string value as in
module2.cls2.attrone.)
An even more complicated example can be had with 'setattr(obj,
"astring", value)'. If you write this twice in two different functions,
the "astring" literals are not interned (and thus are different
strings). However, 'astring' as the attribute name in obj.astring
is interned (this is done in setattr()). If you call one function
with one object and the other function with another object, the
attribute name is still a common interned string.
(In theory direct manipulation of obj.__dict__ might allow you to
create a non-interned attribute name on an instance, although actual
code that accesses it as obj.attr would use an interned version.)
If you are testing this, note that all single-character strings
are interned for you; you need to use
multi-character attribute names to avoid false positives.
(This is undoubtedly far more about this issue than most people want to
know. I'm peculiar that way; I can't resist peeking under the hood.)
Sidebar: interned versus non-interned versions of a string value
In some languages, once you intern a string value all future occurrences
of that string value, anywhere, are automatically converted to the
interned version. CPython doesn't work this way; instead, something has
to explicitly convert a string value into an interned version of it and
otherwise string values are left alone. It's thus entirely possible,
even easy, to have an interned version of a string value as well as one
or more non-interned versions of it.
InstanceStringUsage written at 00:11:49; Add Comment
2011-11-21
A cheap caching trick with a preforking server in Python
When the load here climbs, DWiki (the software behind this blog)
transmogrifies itself into an SCGI based
preforking server. I'm always looking
for cheap ways to speed DWiki up for Slashdot style load surges (however unlikely it is that I'll ever
need such tuning), and it recently occurred to me that there was an
obvious way to exploit a preforking server: cache rendered pages in
memory in each preforked process. Well, not even rendered pages; the
simplest way to implement this is to cache your response objects.
(DWiki already has various layers of caching, but its page cache is disk based. A
separate cache has various advantages (such as cache sharing between
preforked instances) and a disk based cache means that you don't have to
worry about memory exhaustion, only disk space, but both aspects slow
the cache down.)
A simple brute force in-memory cache like this has a number of
attractions. Caching ready to use response objects (combined with
simple time-based invalidation) means that this cache is about as fast
as your application will ever go. It's quite simple to add to your
application, especially if your application already has the concept of a
flexible processing pipeline; you can just add a request-stealing step
early on, and cache the response objects that you're already bubbling up
through the pipeline. Assuming that you're having processes exit after
handling some moderate number of requests, using a per-process cache
creates a natural limit on any inadvertent cache leaks, memory usage,
and cache expiry and invalidation issues; after not too long the entire
process goes away, caches and all.
(You can also size the cache quite low; you might make it one tenth or
one fifth the number of requests that a single process will serve before
exiting. A large cache is obviously relatively pointless; as the cache
size rises, the number of cache hits that the 'tail' of the cache can
ever have drops.)
Adding such an in-memory cache to the preforking version of DWiki
did expose one assumption that I was making. For this cache to work,
response objects have to be immutable after they are finished being
generated. It turned out that DWiki's code for conditional GET cheated
by directly mutating response objects; when I added response object
caching this resulted in a very odd series of HTTP responses that were
half conditional GET replies and half regular replies. I had a certain
amount of head-scratching confusion until I worked out what was going
on and why, for example, I was seeing 304 responses with large response
bodies.
PreforkingCacheTrick written at 23:56:27; Add Comment
|
These are my WanderingThoughts
(About the blog)
GettingAround
Full index of entries
Recent comments
This is part of CSpace, and is written by ChrisSiebenmann.
* * *
Atom feeds are available; see the bottom of most pages.
This is a DWiki.
(Help)
Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web
|