Wandering Thoughts archives

2005-11-28

What Python's global interpreter lock does (and doesn't) protect

Most people doing threading in Python know about Python's Global Interpreter Lock (GIL), which causes only one thread to be running in the CPython interpreter at any one time. This means that threaded code already has a certain amount of implicit locking going on, making certain operations thread-atomic without you writing explicit locks.

The important thing about the GIL for this is that it protects bytecodes, not Python statements. If something happens in a single bytecode, it's protected; otherwise, you need explicit locking. The most important thing done in single bytecodes is calls to methods implemented in C, such as all of the methods on builtin types. So things like list .append() or .pop() are atomic.

(All bets are off if you're dealing with someone's Python subclass of a builtin type, since it depends on what exactly they overrode.)

An important note is that in-place operations like '+=' are not actually single bytecodes, so counters like 'self.active += 1' need locks.

(Because they must be able to work on immutable objects, the actual INPLACE_* bytecodes result in an object, which is then stored back to the left hand side in a separate instruction. In-place operations on mutable builtin types can be atomic, but there aren't many mutable builtin types that support '+' et al.)

By default, Python only switches between runnable threads every hundred bytecodes or so, which can disguise code that isn't threadsafe. You can make Python switch almost every bytecode with sys.setcheckinterval(); see the sys module documentation. (Python opcodes don't always allow thread switching after themselves, so it's never literally every bytecode.)

(Again we see that implementation details matter, although there is an argument that this is too much black magic.)

WhatTheGILProtects written at 01:44:10; Add Comment

2005-11-12

Why using local variables is fast in Python

One piece of Python optimization lore is that access to local variables is very fast, so if you repeatedly refer to something in a loop in a function you can optimize it by copying it to a local variable first (this often comes up when you are repeatedly calling a function). This fast access comes about because of some implementation details in CPython.

The CPython interpreter doesn't run Python code directly, but first compiles it to bytecode for a relatively simple stack-based virtual machine. You can see the actual bytecode of stuff with the dis module; most of the bytecodes themselves are documented here.

The bytecode puts local variables (and the function arguments) into a fixed-size array, statically assigning variable names to indexes, instead of a dictionary; it can do this because you can't add local variables to a function on the fly. Getting a local variable just requires bumping the reference count on the object in the appropriate slot, which is about as fast an operation as you can get.

Looking up global names (and attributes, such as self.foo or math.pow) is significantly slower, since it involves at least one and possibly several dictionary lookups. LOAD_GLOBAL is fairly heavily optimized, going so far as to inline a chunk of dictionary lookup code; LOAD_ATTR is far less so, plus has the likely overhead of multiple dictionary lookups.

This is also why the global statement is necessary; stores to global variables are a different bytecode than stores to local variables, so the bytecode compiler has to know which to generate when it compiles the function. If a variable is ever assigned to, the bytecode compiler declares it a local variable and thus generates STORE_FAST instead of anything else.

(This is a followup to KnowingImplementationsMatters.)

WhyLocalVarsAreFast written at 03:07:01; Add Comment

2005-11-11

The importance of understanding Python's implementation

Joel Spolsky likes to talk about leaky abstractions, how the lower level details underneath abstractions inevitably leak through and affect your work with those abstractions. High level languages like Python are no exception, because how their abstractions are implemented has significant effects on how fast different sorts of code will run.

For instance, let's take lists and tuples, both of which CPython implements as arrays of objects. This has a number of consequences:

  • indexing a list element is cheap, and all list elements are equally cheap to access.
  • you always know the length of lists.
  • popping the last element off is cheap.
  • popping the first element is expensive, because you have to slide everything down.
  • similarly, adding an element at the end is cheap but adding it at the front is expensive.
  • concatenating lists is not cheap, because it requires copying one list onto the end of the other.

Other languages use different approaches, with different consequences; for example, Lisp uses singly-linked lists, making the front of the list the cheapest place to access and to add and remove elements. This matters when implementing things on top of lists; a Python stack should have the top of the stack at the end of the list, while a Lisp stack should have it at the start.

A Python stack implemented the 'wrong' way will have significantly worse performance, and there is very little in the language documentation to tell you which is wrong and which is right; you need to know something about how Python implements lists. (Perhaps you can guess it from Python having array indexing of lists.)

(My conscious awareness of this issue owes a significant debt to the discussion in the comments on my 'Minimizing object churn' entry.)

KnowingImplementationsMatters written at 00:33:29; Add Comment

2005-11-09

Using Python introspection for semi-evil

One of the things I write in Python is network daemons. Because it works so nicely, network daemons usually take input as text lines that start with a command word and have whitespace separated arguments. There's a certain amount of eye-rolling tedium in writing code to check that the command word is valid and has the right number of arguments. When I wrote a program that does a lot of this, the tedium finally got to me and I sort of snapped and automated all of the validation through Python's introspection features.

The program's basic structure has an object for each connection. Each command is handled by a method function on the object, and the functions all take normal argument lists. (So they are not passed the command line as a big list or anything.)

The name of each command method is 'op_<command>', eg 'op_sample' to handle the 'sample' command. To check whether the line's first word was a valid command, I just looked to see if the connection's object had an attribute with the appropriate name.

(This is a fairly common pattern in Python; see, for example, how BaseHTTPServer handles dispatching GET and POST and so on.)

To check that the number of arguments was right, I reached into the handler function I'd just found and fished out how many arguments it expected to be called with (compensating for the additional 'self' argument that method functions get). This isn't at all general, but I didn't need generality; the network daemon's commands all have a fixed number of arguments.

The code wound up looking like this:

# in a class:
def op_sample(self, one, two):
  ...

def dispatch(self, line):
  # Do we have a line at all?
  n = line.split()
  if not n:
    return False
  cword = 'op_' + n[0]

  # Valid command word?
  cfunc = getattr(self, cword, None)
  if not cfunc:
    return False

  # Right argument count?
  acnt = cfunc.func_code.co_argcount
  if acnt != len(n)
    return False

  return cfunc(*n[1:])

(The real version used more elaborate error handling for 'empty line', 'no such command', and 'wrong number of arguments'.)

Normally I would have to account for the extra self argument in the function's argument count. But in this code the n list has one extra element (the command word) too, so it balances out. This has the subtle consequence that you can't make op_sample a staticmethod function, because then it would have the wrong argument count.

(I did say this was semi-evil.)

SemiEvilIntrospection written at 02:30:10; Add Comment

2005-11-07

Examining Python's string concatenation optimization

In the comments on the other day's 'minimizing object churn' entry, it was noted that Python string concatenation had been optimized some in the recent 2.4 release of the CPython interpreter. The 2.4 release notes say (from here):

  • String concatenations in statements of the form s = s + "abc" and s += "abc" are now performed more efficiently in certain circumstances.

But what is 'more efficiently' and what are the 'certain circumstances'? What does it take to get the optimized behavior?

Here's the summary: counting on this optimization is unwise, as it turns out to depend on low-level details of system memory allocation that can vary from system to system and workload to workload. Also, this is only for plain (byte) strings, not for Unicode strings; as of Python 2.4.2, Unicode string concatenation remains un-optimized.

The bits of string concatenation that theoretically can be optimized away are allocating a new string object, copying the one of the old string objects to the new one, and freeing the old object. (No matter what, CPython has to copy the data from one of the new string objects into the other.)

The new 2.4 optimization attempts to reuse the left side string object and enlarge it in place. For this to be possible, the object needs to have a reference count of one (and not be intern'd). To help create this situation, the CPython interpreter peeks ahead to see if the object has exactly two references and the next bytecode is an assignment to a local variable that currently points to the same object; if so, the variable gets zapped on the spot, dropping the reference count to one.

(Nit: this assignment reference dropping also happens for module-level code that refers to global variables.)

Even when the left side string object can be reused, we're not actually saving anything unless it can be enlarged in place. To be enlarged in place it needs to be at least about 235 bytes long (on 32-bit systems), and for the C library realloc() to have enough free memory after it that it can be enlarged into.

(CPython uses an internal allocator that always grows things by copying for all allocations of 256 bytes or less; string objects have about 21 bytes of overhead on a normal 32-bit platform. For more details, see the CPython source. If you're interested in the gory details, start with the string_concatenate function in Python/ceval.c and work outwards.)

I suspect that the optimization is most likely to trigger when you are repeatedly growing a fairly large string by a small bit, as C libraries often use allocation strategies that leave reasonable amounts of space free after large objects. If your program mostly deals with smaller strings, you may not benefit much (if at all) from this optimization.

One consequence of all this is that none of the following will be optimized, because the left side reference (implicit in the case of '+=') cannot be reduced to a reference count of one, because the assignment is of the wrong type:

def foo(self, alst, a):
  global bar; bar += a
  self.foo += a
  alst[0] += a

This isn't too surprising, since both self.foo and alst[0] aren't necessarily simple store operations once the dust settles. The global case could probably be optimized, but may be considered to be too uncommon to bother writing the code for.

However, 's = strfunc(...) + strfunc2(...)' and the like can get optimized, unless there's another reference to strfunc()'s return value hanging around someone. In fact a whole series of string concatenations of function return values can get optimized down. (Always assuming that the string sizes and the free memory and so on work out.)

Sidebar: what about optimizing string prepending?

By string prepending, I mean 's = "abc" + s' (versus 's = s + "abc"'). CPython doesn't optimize this, because it only looks at the left side string object.

You can't significantly optimize string prepending in CPython, because you always have to move s's current contents up to make room at the start for the string you're sticking on the front. This means that all you could possibly save is the malloc() and free() overhead and the overhead of setting up the new string's object header. This is probably small relative to copying the existing string data, so not a very compelling optimization.

(I suspect that string prepending is also uncommon in Python code.)

ExaminingStringConcatOpt written at 01:31:48; Add Comment

2005-11-05

Minimizing object churn to optimize Python code

One of the interesting things about writing code in high level garbage collected languages is how your optimization concerns can change. One optimization in all GC'd languages, Python included, is minimizing the creation of new objects and maximizing the reuse of old ones; the fewer new objects you create, the fewer cycles garbage collection eats, the less memory gets fragmented, and you often even use less memory overall.

Minimizing object churn is behind several well-known Python (anti-)patterns, such as 'avoid repeated string concatenation to build a string'.

(The simple approach to accumulating one big string from a bunch of function calls is to concatenate each value to the building result string as you get it. But this churns objects (and copies data) at every concatenation; the right pattern is to put all the function results in a list and then use "".join(reslst) at the end.)

A lot of object churn avoidance can be summed up as leave objects alone if at all possible. This is especially important for large immutable objects like strings; any time you fiddle with a string you wind up with a new object. This means that you want to push fiddling with objects as close to their creation as possible, because this maximizes the chance for sharing the same object.

When you write code, look at it with an eye to how many objects it churns through, and how necessary those are. Some techniques include:

  • if you always modify constants in some way, pre-compute the modified version and store it too.
  • batch up things and do them all at once ("".join() is an example of this).
  • use things like list comprehensions and string .join(); being built in to the interpreter, they can optimize in ways not possible for your Python code.
  • it's better to return an existing constant string than the 'same' string (same value) that you parsed from another string. (This generalizes to all objects.)
  • if you have to lowercase strings that are usually lower-cased (or upper-cased, or trimmed of surrounding whitespace, and so on) to start with, it make be worthwhile to do something like 'nstr = ostr.lower(); if nstr == ostr: nstr = ostr'. This doesn't reduce object churn much, but it will save memory and it minimizes fragmentation by destroying any unnecessary nstr's almost immediately. (This too can be generalized beyond strings.)
  • consider caches; one of their side effects is reducing object churn by returning references to existing objects. Of course, the devil is in the details, including cache aging.

Sidebar: saving string memory with intern

Python lets you 'intern' strings via the built-in intern() function, which maps duplicate strings into the same object. This makes it the easy way to eliminate memory wasted due to duplicate strings in your program, at the cost of a hash lookup in intern() for every string that you want to de-duplicate. (Really, the documentation explains this better than I do.)

You really want to be using Python 2.3 or later before you use this much, because before Python 2.3, any interned strings were immortal. (The interning code held a reference to them so they never got garbage collected.)

MinimizingObjectChurn written at 03:17:32; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.