Wandering Thoughts archives

2011-11-25

Python instance dictionaries, attribute names, and memory use

In a comment on my entry on what __slots__ are good for, Max wrote:

On the other hand, having __slots__ saves the strings that the instance dictionary entries would point to for the attribute names. On a 4 byte string platform, that adds up quickly too.

Although one might naturally think that this is the case, CPython is actually sufficiently clever that it is not so; using __slots__ doesn't save you any memory for attribute names because the string values of attribute names are already only stored once. However understanding how and why requires a reasonable amount of knowledge about CPython internals.

(Or you have to know to look at the documentation for the intern() function, which casually mentions this in passing.)

Like many similar languages, Python has string interning and the CPython internals make liberal use of interned strings for any code-related string that might look like it's going to be repeated. Attribute names are one such example of this; starting right in the code itself, all attribute names are fully interned. So you always have the same set of interned strings for attribute names regardless of how the attributes are stored and regardless of how many instances of the class you have.

(This is quite similar to part of the concept of 'symbols' in languages like Lisp and Ruby, although both of those expose symbols directly to user-level code.)

More specifically, all names used directly as attributes are interned. There are a number of ways where you can use real strings as attribute names and these will not be interned. The most prominent example is actually __slots__ itself, although things get confusing here. Consider:

class A(object):
  __slots__ = ('attrone', 'attrtwo')

  def __init__(self):
    self.attrone = 10

  def report(self):
    return self.attrone

The two string literals in __slots__ are not interned. However, the same string value ('attrone') is interned in __init__ and report(). If you have lots of code that all refers to '<something>.attrone', all of it will do all attribute lookups using the same interned string value.

(Note that attribute names are interned globally, not on a per-class basis or the like. The 'attrone' in the attribute name module1.cls1.attrone is the same interned string value as in module2.cls2.attrone.)

An even more complicated example can be had with 'setattr(obj, "astring", value)'. If you write this twice in two different functions, the "astring" literals are not interned (and thus are different strings). However, 'astring' as the attribute name in obj.astring is interned (this is done in setattr()). If you call one function with one object and the other function with another object, the attribute name is still a common interned string.

(In theory direct manipulation of obj.__dict__ might allow you to create a non-interned attribute name on an instance, although actual code that accesses it as obj.attr would use an interned version.)

If you are testing this, note that all single-character strings are interned for you; you need to use multi-character attribute names to avoid false positives.

(This is undoubtedly far more about this issue than most people want to know. I'm peculiar that way; I can't resist peeking under the hood.)

Sidebar: interned versus non-interned versions of a string value

In some languages, once you intern a string value all future occurrences of that string value, anywhere, are automatically converted to the interned version. CPython doesn't work this way; instead, something has to explicitly convert a string value into an interned version of it and otherwise string values are left alone. It's thus entirely possible, even easy, to have an interned version of a string value as well as one or more non-interned versions of it.

InstanceStringUsage written at 00:11:49; Add Comment

2011-11-21

A cheap caching trick with a preforking server in Python

When the load here climbs, DWiki (the software behind this blog) transmogrifies itself into an SCGI based preforking server. I'm always looking for cheap ways to speed DWiki up for Slashdot style load surges (however unlikely it is that I'll ever need such tuning), and it recently occurred to me that there was an obvious way to exploit a preforking server: cache rendered pages in memory in each preforked process. Well, not even rendered pages; the simplest way to implement this is to cache your response objects.

(DWiki already has various layers of caching, but its page cache is disk based. A separate cache has various advantages (such as cache sharing between preforked instances) and a disk based cache means that you don't have to worry about memory exhaustion, only disk space, but both aspects slow the cache down.)

A simple brute force in-memory cache like this has a number of attractions. Caching ready to use response objects (combined with simple time-based invalidation) means that this cache is about as fast as your application will ever go. It's quite simple to add to your application, especially if your application already has the concept of a flexible processing pipeline; you can just add a request-stealing step early on, and cache the response objects that you're already bubbling up through the pipeline. Assuming that you're having processes exit after handling some moderate number of requests, using a per-process cache creates a natural limit on any inadvertent cache leaks, memory usage, and cache expiry and invalidation issues; after not too long the entire process goes away, caches and all.

(You can also size the cache quite low; you might make it one tenth or one fifth the number of requests that a single process will serve before exiting. A large cache is obviously relatively pointless; as the cache size rises, the number of cache hits that the 'tail' of the cache can ever have drops.)

Adding such an in-memory cache to the preforking version of DWiki did expose one assumption that I was making. For this cache to work, response objects have to be immutable after they are finished being generated. It turned out that DWiki's code for conditional GET cheated by directly mutating response objects; when I added response object caching this resulted in a very odd series of HTTP responses that were half conditional GET replies and half regular replies. I had a certain amount of head-scratching confusion until I worked out what was going on and why, for example, I was seeing 304 responses with large response bodies.

PreforkingCacheTrick written at 23:56:27; Add Comment

2011-11-11

My view of the right way to copy lists in Python

I recently wound up reading Python: copying a list the right way (it's old, but it made things like Hacker News recently); in it, its author argues that the clearer way to copy a list in Python is not 'new = old[:]' but 'new = list(old)'. In thinking about it since then, I've wound up more or less disagreeing with this.

It's true that the effects of old[:] are not immediately obvious (although this idiom has the advantage that it looks clearly magical; if you don't understand it, you know that you don't understand it). But I maintain that list(old) is also magical in an important but less obvious way: it's not obvious that list() returns a copy of the list.

One of the understandings of list() that I think you can reasonably have is that it's a function that turns things into lists, if the thing is plausibly list-like in the first place (otherwise it fails). In this view, if you call list() on something that's already a list it's perfectly sensible for list() to return it as-is; you asked list() to make it into a list, and list() is saying 'here you go, it already is'. In fact you may want this behavior.

If you are just learning Python, it is far from obvious that this understanding of list() is in fact wrong and it is guaranteed to make a copy of old. Further, there is no straightforward semantic reason that list() returns a copy; it does so ultimately because that's how it's documented to work. If list()' is an 'intuitive' way of making copies of lists, it's a misleading intuition. If you relied on it, you've just gotten lucky.

(Note that there are list()-like things which do not make copies.)

I'm not going to claim that old[:] is any more intuitive a way of making copies than list(old). I am just going to claim that it is not any less intuitive; in either case, you really have to learn this as an idiom. I personally prefer old[:], both because it looks more clearly magical and because I think it signals more strongly that a copy is being made (since slicing generally has to make new things).

(Note that if you have an arbitrary indexable object, there is no strong semantic requirement that old[:] not return old (and in fact tuples behave this way). There probably is a cultural expectation that it not do so if old is a mutable object, but that's because 'old[:]' has become the general Python idiom for 'make me a copy of this indexable thing'.)

Sidebar: a slight dive into the semantics of list()

If you know Python at a decent level, some of what I wrote above is making you frown. This is because list() is not actually a function; instead it is a class aka a type. Once you know that list() is actually a type, you have a much stronger case for expecting list(old) to return a copy; all classes/types return new instances when called (unless they go well out of their way to do otherwise).

Since this behavior is strongly embedded in what classes do and how they operate, I expect that people do assume as a matter of course that all classes return new instances even if they're called (validly) with existing instances of the class as initializers. The times when people are okay with some variant of singletons are when this creates almost no semantic difference, ie when the class instances are immutable once created.

(And now I think I have wandered far enough down this particular rathole for one entry.)

OnCopyingLists written at 01:01:10; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.