Wandering Thoughts: Recent Entries

2012-01-27

Why metaclasses work in Python

I've covered what you can do with metaclasses (1, 2, 3, 4) and even, sort of, the low level details of how they work (1, 2, 3). But I've never covered the high level view of why metaclasses work, ie what overall Python features make them go (partly because I am so immersed in Python arcana that much of that stuff feels obvious to me, although I doubt it actually is).

To start with, in Python everything is an object and all objects are an instance of something (yes, there are spots where this gets recursive). This includes even things that you wouldn't normally think of as objects, such as functions. Crucially, this includes classes: classes are objects. Any time you have an object in Python, a lot of its behavior is usually provided by whatever it is an instance of (to avoid confusion, I'll call this the type of the object). Classes are no exception to this; a lot of how classes behave is handled by their type, even things like how a new object gets created when you call the class.

(For simplicity, I'm going to ignore old-style Python 1.x classes from here onwards and assume that all classes are new-style Python 2 classes that ultimately subclass object.)

To avoid a point of confusion: classes have ancestor ('base') classes that they inherit from (or just object(), the root class). However, classes are not instances of their base class; we can see why this has to be when we note that a class can inherit from multiple base classes. You can't be an instance of several different things at once. So classes exist in a two-dimensional relationship; they inherit from one or more base classes, and at the same time they are instances of something that provides much of their 'class' behavior. The type of classes (the thing that provides the 'class' behavior) is called type().

(This two dimensional structure can get a bit weird.)

In some languages, the creation of classes is black magic that happens deep in the interpreter and isn't something you can do inside the language (even if the classes are visible as objects). Python has instead chosen to expose the ability to create classes by hand; you you can do this by calling type() with the right arguments (and then binding the class object to a name), just as you create instances of normal classes by calling the class itself. As part of creating classes yourself by hand, you can obviously manipulate class creation; you can create a new class with whatever methods, base classes, and so on you want.

(What's odd about type() is that despite it being a class, you can call it with a single object to get the type of the object.)

Python is also an unusual language in another way; in Python, things like defining functions and classes are themselves executable statements. Python doesn't parse your program, create all the functions and classes, and then start running your code; instead it starts running your code and things like def and class execute on the fly (as does import and so on). So it's natural to have your code running as classes are being created.

The combination of these two things means that Python can easily provide a way to hook your own code into the process of creating the class objects for classes that are written in straight Python, with 'class X(object): ....'. Python is already running code in general when this happens, and the mechanisms of creating classes by hand means it's relatively easy for Python to hand you the bits of the class-to-be so you can modify it and then have everything continue onwards to create a new class. This is why metaclasses can change classes as they are being created.

The other half of why metaclasses work is that Python allows classes to be instances of something other than type(). Since classes get a lot of their 'class' behavior through normal instance method inheritance from type(), a class being an instance of something other than type() lets the other thing intercept or change the normal as-a-class behavior for that class (for example, what happens when you call the class). This is why metaclasses can do things with a class after the class has been created.

WhyMetaclassesWork written at 00:39:39; Add Comment

2012-01-16

Understanding isinstance() on Python classes

Suppose that you have:

class A(object):
  pass

class B(A):
  pass

As previously mentioned, the type of classes is type, which is to say that class objects are instances of type:

>>> isinstance(A, type)
True
>>> isinstance(B, type)
True

Both A and B are clearly subclasses of object; A is a direct subclass and B is indirectly a subclass through A. In fact every new-style Python class is a subclass of object, since object is the root of the class inheritance tree. However, class type is not the same as class inheritance:

>>> issubclass(B, A)
True
>>> isinstance(B, A)
False

Although B is a subclass of A, it is not an instance of A; it is a direct instance of type (we can see this with 'type(B)'). Now, given that A and B are instances of type, one might expect that they would not be instances of object since they merely inherit from it, as B inherits from A:

>>> isinstance(A, object)
True

Well, how about that. We're wrong (well, I'm wrong, you may already have known the correct answer). Here is why:

>>> issubclass(type, object)
True

A and B are instances of type and, like all other classes and types, type is a subclass of object. So A and B are also instances of object (at least in an abstract, Python level view of things), in the same way that an instance of B would also be an instance of A.

I believe that this implies that 'isinstance(X, object)' is always true for anything involved in the new-style Python object system. The corollary is that this is an (almost) surefire test to see if the random object you are dealing with is an old style class or an instance of one:

class C:
  pass

>>> issubclass(C, object)
False
>>> isinstance(C, object)
False

(This goes away in Python 3, where there is only new-style classes and there is much rejoicing, along with people no longer having to explicitly inherit from object for everything.)

PS: as originally noted by Peter Donis on a comment here, object is also an instance of type because object is itself a class. type is an instance of itself in addition to being a subclass of object. Try not to think about the recursion too much.

(This isinstance() surprise is an easy thing to get wrong, which is why I'm writing it down; I almost made this mistake in another entry I'm working on.)

Sidebar: isinstance() and metaclasses

If A (or B) has a metaclass, it is an instance of the metaclass instead of a direct instance of type. In any sane Python program, 'isinstance(A, type)' will continue to be True because A's metaclass will itself be a subclass of type.

(I'm not even sure it's possible to create a working metaclass class that doesn't directly or indirectly subclass type (cf), but I'm not going to bet against it.)

This implies that I was dead wrong when I said, back in ClassesAndTypes, that 'type(type(obj))' would always be 'type' for any arbitrary Python object, as Daniel Martin noted at the time and I never acknowledged (my bad). In the presence of metaclasses, type(type(obj)) can be the metaclass instead of type itself. Since metaclasses can themselves have metaclasses, so there is no guarantee that any fixed number of type() invocations will wind up at type.

ClassesAndIsinstance written at 22:32:55; Add Comment

2012-01-02

An example sort that needs a comparison function

In reaction to my entry on Python 3 dropping the comparison function for sorting, some people may feel that a sorting order that is neither simple field-based nor based on a computed 'distance' (the two cases easily handled by a key function) is unrealistic. As it happens I can give you a great example of a sort order that cannot be handled in any other way: software package versions on Linux systems.

For simplicity (and because I know RPM best), I'm going to talk about RPM-based version numbers. RPM version numbers have three components, an epoch, a version, and a release, and ordering is based on comparing each successive component in turn. The epoch is a simple numeric comparison (higher epochs are more recent), but both the version and release can have sub-components and each sub-component must be compared piecewise using a relatively complex comparison for each piece (they can be all digits, letters, or mixed letters and digits). Something with extra sub-components is more recent than something without it, so version 1.6.1 is more recent than version 1.6. A full package version can look like '1:2.4.6-4.fc16.cks.0'; '1:' denotes the epoch, the version is '2.4.6', and the release is '4.fc16.cks.0'.

(Most RPM packages have an epoch of '1' '0', which is conventionally omitted when reporting package versions.)

In the presence of potential letter-based subcomponents and the complex comparison rules, you can't compare these version numbers using simple field-based rules, not even if you split sub-components up into tuples and then compare a tuple-of-tuples (it's possible if all sub-components are simple numbers). Nor can you compute some sort of single numerical 'distance' value for a particular version number, especially since version numbers are sort of like the rational numbers in that you can always add an essentially unlimited number of additional versions between any two apparently adjacent versions. The only real operation you have is a pure comparison, where you answer the question 'is X a higher version than Y', and this comparison requires relatively intricate code.

(Having said that, DanielMartin showed a nice way to transform things so that a key-function based sort can be used for a comparison function sort in comments on the earlier entry.)

ExampleSortComparison written at 01:49:41; Add Comment

2011-12-30

Why I don't like Python 3 dropping the comparison function for sorting

One of the changes that Python 3 has made is that, to quote the documentation:

builtin.sorted() and list.sort() no longer accept the cmp argument providing a comparison function. Use the key argument instead. [...]

I feel unreasonably annoyed about this change. At least on the surface there's no obvious reason why; basically all of the uses of a comparison function I've ever used are to pick a specific field out, and that's handled much better by the key argument. However, I've recently figured out what irritates me about this: it couples data and behavior too closely.

In the new world, there are three ways to create a sort ordering. If your ordering depends on explicit fields (possibly modified), you can use a straightforward key function. If the ordering of a data element is strictly computable from a single element (for example, a 'distance' metric that's easy to determine), you can use a key function which synthetically computes an element's ordering and returns it. And if neither of these holds and you can only really determine a relative ordering, you can define a __lt__ method on your objects.

The problem with the last approach is that, of course, you can only have one __lt__ method and thus only one sort ordering. What's happened is that you've been forced to couple the raw data with the behavior of a particular sort ordering. Getting around this requires various hacks, such as synthetic wrapper objects with different __lt__ functions.

(The other problem is that your data needs to be actual objects. While this is usually the case for anything complex enough that you only can do a relative ordering, sometimes you're getting the data from an outside source and it would be handy to leave it in its native form.)

While this is only a theoretical concern for me, it still irritates me a bit that Python 3 has chosen to move towards closer, less flexible coupling between data and ordering. I maintain that the two are separate and we can see this in the fact that there are many possible orderings for complex data depending on what you want to do with it.

By the way, I can see several reasons why Python 3 did this and I sympathize with them (even if I still don't like dropping cmp). The Python 3 documentation notes that key is more efficient since it's called only once per object you're sorting. On top of that, it's relatively easy to make mistakes with complex cmp functions that create inconsistent ordering, which potentially causes sorting algorithms to malfunction mysteriously.

Python3SortCmpFunction written at 02:03:10; Add Comment

2011-12-27

Python 3 from the perspective of someone writing new Python code

I've talked about Python 3 from the perspective of a Unix sysadmin and Python 3 from the perspective of someone with existing Python 3 code; now it's time for the final viewpoint, that of someone writing new code.

There are a bunch of practical difficulties with this, things like having Python 3 installed on machines and third party modules being ported to Python 3, but they're either gone or going away (and most of what I write doesn't depend on third party modules). Ignoring those issues as ultimately unimportant, I don't think there's any reason not to write new, non-sysadmin code in Python 3. It's clearly the future of Python and although I may grump about some decisions, there's a fair amount to like about it. Yes it's different but much of that difference is good.

(I've made a vaguely similar transition in Python programming before, when I moved from 1.x to 2.x. It was a more backwards compatible change and I felt it was less wrenching, but it had just the same sort of generally neat new things in the new version. Today, for example, if I write an old-style class it's by accident.)

I have to admit that this is a theoretical view right now, because I haven't tried to write anything new in Python 3 yet. Most of what I've written recently is sysadmin tools and those need to be in Python 2 for the foreseeable future. But the next time I come up with a Python program to write I'm going to keep this in mind and try to write it in Python 3 instead of Python 2, no matter what my inertia is saying.

(A good step would be to make sure that as many of our machines as possible actually have Python 3 installed. Now that I look, some of them don't have it installed by default, which isn't going to help Python 3's adoption any.)

PS: the one Python 3 change that's going to be irritating me for years is the whole Unicode-ification of everything in sight. This deserves a longer discussion than fits within the margins of this entry and besides, this entry is a positive one. Also, I suspect that once I start actually using Python 3, the Unicode stuff will prove to be less of a pain than I currently expect it to be.

Python3NewCode written at 03:41:11; Add Comment

2011-12-21

Python 3 from the perspective of someone with existing Python code

Last time, I talked about Python 3 from the perspective of a Unix sysadmin. Today I want to talk about Python 3 from the perspective of someone who has a not insignificant amount of current Python code. I don't have huge (by Python standards) programs, but I do have various things (not all large) currently running live, for real, doing things that I care about.

Recently I read Armin Ronacher's Thoughts on Python 3, where he wrote (among other things):

Because as it stands, Python 3 is the XHTML of the programming language world. It's incompatible to what it tries to replace but does not offer much besides being more "correct".

I'm kind of sad to say this, but what he said (down to the comparison with XHTML).

Some of my code has a decent amount of tests but not all of it, and all of it currently works. Migrating it to Python 3 requires a significant amount of effort and testing, even for the code that has tests, and in exchange I get basically nothing except a warm fuzzy feeling that I am 'modern'. It would be pure make-work. Worse, it would be make-work that runs a good risk of destabilizing working code.

There are two aspects to the problem. The first is simply that Python 3 is a big change from Python 2. I'm willing to make small or moderate changes purely for compatibility purposes, but I've certainly been left with the impression that Python 3 requires some significant changes (even if a number of them will work in Python 2.7, the issue is the amount of changes to the current code). The second is that Python 3's handling of strings and Unicode demand an architectural change in code that is currently ignoring the issue and just shoving around plain byte strings, which describes all of my current code. Part of this is just switching to Unicode by itself, but part of it is that since conversions to and from Unicode can fail I now need to find all of these places and figure out what I want to do.

(This also increases the risk of the changes. If I miss a place where a conversion can fail, my code may blow up at some point in the future with uncaught exceptions in a situation where it works today. This is not really an attractive selling point and yes, I would rather have mojibake than explosive failures. Among other reasons, to a first order approximation mojibake is caused by someone else's mistake while uncaught exceptions are clearly my fault.)

The result is that I can't possibly justify migrating any significant amount of my current code to Python 3 (either to myself or to others). It will remain Python 2 code unless and until I have no choice, and if I stop having a choice I'm going to fiercely resent it.

(This is entirely apart from any pragmatic issues such as dependencies that haven't yet been ported to Python 3. Most of my code doesn't use third-party modules or code anyways, just standard library stuff.)

Python3ExistingCode written at 22:53:14; Add Comment

2011-12-17

Python 3 from the perspective of a Unix sysadmin

I've been thinking about Python 3 for a while, mulling over things like how I feel about it and how likely I am to use it, and I've decided that one reason my feelings are complex is that I have three different views of it, from three different perspectives. Today is the day for the first perspective: Python 3 from the perspective of a Unix sysadmin who uses Python to program important parts of our systems.

I don't have any way to put this nicely, so I'll say it right up front: for a Unix sysadmin, Python 3 is currently highly radioactive and should be completely avoided. Our current systems are written in Python 2; there is no prospect of this changing and I am going to keep writing sysadmin things in Python 2 for the indefinite future. I will stop this only when the systems we use stop packaging Python 2, and I certainly hope that that doesn't happen for, oh, a decade or more.

The fundamental problem is that Python 3 wants the operating system environment to be Unicode, and Unix is not. When Python 3 comes into contact with messy reality, bad things happen and things fail. These failures are vaguely tolerable for ordinary user programs; they are intolerable for programs used for system management. I cannot afford to write programs that silently omit names from os.listdir()'s results, that don't see some environment variables sometimes, or that die with mysterious error messages if given the wrong arguments. There are workarounds for some of these issues (but none yet for the sys.argv issue), but they are limited in scope and unlikely to be pervasive (in, eg, third party modules that I want to use).

So long as Python 3 is busy denying Unix reality (and causing all sorts of complications as a result of this), the sysadmin side of me can't and isn't going to touch it. I doubt that the Python 3 developers care about this and I doubt that anything is going to change in Python 3, which is kind of a pity.

(I could probably write system tools in Python 3 if I wanted to and tried hard enough and had to, but I don't see any reason to do so given that Python 2 is there and going to be there for a long time to come. Python 2 works, it works without huge contortions, and I don't really see anything compelling in Python 3 so far.)

Sidebar: on the long term availability of Python 2

At this point in time I see essentially no prospect of Python 2 being removed from Linux distributions in the next five years (minimum). The very first step along the long path of removing Python 2 would be for distributions to migrate Python based system tools from Python 2 to Python 3, and that hasn't even started yet (distributions are just now starting to talk about maybe moving some of their Python-based tools to Python 3 for their next release).

The chances of Python 2 disappearing any time soon from more conservative and slow moving Unixes like FreeBSD and Solaris (and Mac OS X) are best described as 'laughable'.

Python3Sysadmin written at 02:59:31; Add Comment

2011-12-13

DWiki's code is now on Github (among other things)

As a followup to my first experiment with coding in public, I've put a few other Python projects up on Github. They are:

  • dwiki, the code for DWiki itself (the software that runs this blog), plus the basic page templates and so on that I use. I'm not entirely happy with the actual organization of the code, but I have no energy to reform it at this point (or, more likely, rewrite it from scratch).

    (At the moment the specific additional templates for WanderingThoughts are not bundled in.)

  • portnanny is a powerful inetd-like frontend for a single TCP service, with a great deal of filtering power. It's also the Python code that I'm probably most proud of, since I think I did a decent job of structuring it and writing tests.

    (The quality of its code may be related to the fact that it was a total rewrite of an earlier attempt.)

  • python-netblock is a Python module for dealing with sets of IP address ranges; as part of this it has a module for sets of integer ranges in general. It comes with a command line netblock calculator that I use all the time (although there's no manpage for it right now).

I've made an index page for all of my Github things that I intend to keep up to date, or you can of course just look at things on Github.

DWikiGithub written at 12:01:45; Add Comment

2011-11-25

Python instance dictionaries, attribute names, and memory use

In a comment on my entry on what __slots__ are good for, Max wrote:

On the other hand, having __slots__ saves the strings that the instance dictionary entries would point to for the attribute names. On a 4 byte string platform, that adds up quickly too.

Although one might naturally think that this is the case, CPython is actually sufficiently clever that it is not so; using __slots__ doesn't save you any memory for attribute names because the string values of attribute names are already only stored once. However understanding how and why requires a reasonable amount of knowledge about CPython internals.

(Or you have to know to look at the documentation for the intern() function, which casually mentions this in passing.)

Like many similar languages, Python has string interning and the CPython internals make liberal use of interned strings for any code-related string that might look like it's going to be repeated. Attribute names are one such example of this; starting right in the code itself, all attribute names are fully interned. So you always have the same set of interned strings for attribute names regardless of how the attributes are stored and regardless of how many instances of the class you have.

(This is quite similar to part of the concept of 'symbols' in languages like Lisp and Ruby, although both of those expose symbols directly to user-level code.)

More specifically, all names used directly as attributes are interned. There are a number of ways where you can use real strings as attribute names and these will not be interned. The most prominent example is actually __slots__ itself, although things get confusing here. Consider:

class A(object):
  __slots__ = ('attrone', 'attrtwo')

  def __init__(self):
    self.attrone = 10

  def report(self):
    return self.attrone

The two string literals in __slots__ are not interned. However, the same string value ('attrone') is interned in __init__ and report(). If you have lots of code that all refers to '<something>.attrone', all of it will do all attribute lookups using the same interned string value.

(Note that attribute names are interned globally, not on a per-class basis or the like. The 'attrone' in the attribute name module1.cls1.attrone is the same interned string value as in module2.cls2.attrone.)

An even more complicated example can be had with 'setattr(obj, "astring", value)'. If you write this twice in two different functions, the "astring" literals are not interned (and thus are different strings). However, 'astring' as the attribute name in obj.astring is interned (this is done in setattr()). If you call one function with one object and the other function with another object, the attribute name is still a common interned string.

(In theory direct manipulation of obj.__dict__ might allow you to create a non-interned attribute name on an instance, although actual code that accesses it as obj.attr would use an interned version.)

If you are testing this, note that all single-character strings are interned for you; you need to use multi-character attribute names to avoid false positives.

(This is undoubtedly far more about this issue than most people want to know. I'm peculiar that way; I can't resist peeking under the hood.)

Sidebar: interned versus non-interned versions of a string value

In some languages, once you intern a string value all future occurrences of that string value, anywhere, are automatically converted to the interned version. CPython doesn't work this way; instead, something has to explicitly convert a string value into an interned version of it and otherwise string values are left alone. It's thus entirely possible, even easy, to have an interned version of a string value as well as one or more non-interned versions of it.

InstanceStringUsage written at 00:11:49; Add Comment

2011-11-21

A cheap caching trick with a preforking server in Python

When the load here climbs, DWiki (the software behind this blog) transmogrifies itself into an SCGI based preforking server. I'm always looking for cheap ways to speed DWiki up for Slashdot style load surges (however unlikely it is that I'll ever need such tuning), and it recently occurred to me that there was an obvious way to exploit a preforking server: cache rendered pages in memory in each preforked process. Well, not even rendered pages; the simplest way to implement this is to cache your response objects.

(DWiki already has various layers of caching, but its page cache is disk based. A separate cache has various advantages (such as cache sharing between preforked instances) and a disk based cache means that you don't have to worry about memory exhaustion, only disk space, but both aspects slow the cache down.)

A simple brute force in-memory cache like this has a number of attractions. Caching ready to use response objects (combined with simple time-based invalidation) means that this cache is about as fast as your application will ever go. It's quite simple to add to your application, especially if your application already has the concept of a flexible processing pipeline; you can just add a request-stealing step early on, and cache the response objects that you're already bubbling up through the pipeline. Assuming that you're having processes exit after handling some moderate number of requests, using a per-process cache creates a natural limit on any inadvertent cache leaks, memory usage, and cache expiry and invalidation issues; after not too long the entire process goes away, caches and all.

(You can also size the cache quite low; you might make it one tenth or one fifth the number of requests that a single process will serve before exiting. A large cache is obviously relatively pointless; as the cache size rises, the number of cache hits that the 'tail' of the cache can ever have drops.)

Adding such an in-memory cache to the preforking version of DWiki did expose one assumption that I was making. For this cache to work, response objects have to be immutable after they are finished being generated. It turned out that DWiki's code for conditional GET cheated by directly mutating response objects; when I added response object caching this resulted in a very odd series of HTTP responses that were half conditional GET replies and half regular replies. I had a certain amount of head-scratching confusion until I worked out what was going on and why, for example, I was seeing 304 responses with large response bodies.

PreforkingCacheTrick written at 23:56:27; Add Comment

These are my WanderingThoughts
(About the blog)

GettingAround
Full index of entries
Recent comments

This is part of CSpace, and is written by ChrisSiebenmann.

* * *

Atom feeds are available; see the bottom of most pages.

This is a DWiki.
(Help)

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web

Search:
[There's more, starting at 2011/11/11 or Previous 10]
(Previous day)
By day for January 2012: 2 16 27; before January.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.