2007-08-11
On the performance of Python longs being used as bitmaps
A comment on the previous entry got me
interested in what exactly cost all the time when using Python longs as
bitmaps.
Some poking shows that the time & takes scales up based on the smaller
long involved, and the time | takes scales based on the larger long
involved. While this is not particularly surprising, it means that
setting and clearing any bit has a cost based on how big the bitmap is,
ie, what the largest set bit is.
(Checking low bits in a large bitmap is relatively cheap, since one
side of the & will be small.)
There are two core performance issues with longs as bitmaps: there's no bit addressing, and you can't mutate a long in place so any change to your bitmap requires allocating a new object. The first is the most important, because it means you are making longs left and right and then doing big expensive operations merely to deal with one bit.
While Python could in theory add bit addressing to longs (allowing you to get and set a specific bit), I doubt it will, since I suspect that the need is relatively esoteric and you can probably do about as well with a list of Python integers (or a C level module if you really need speed).
Sidebar: some numbers
Here's a performance chart for bmap & ~(1L << bit), where bmap has
only bit and bit+1 set:
| bit | time (µsec) | doing the << | doing the ~ | doing the & |
| 0 | 0.59 | 40.1% | 20.2% | 39.7% |
| 1 | 0.59 | 41.1% | 21.4% | 37.4% |
| 8 | 0.60 | 40.4% | 20.0% | 39.6% |
| 16 | 0.63 | 39.7% | 18.2% | 42.1% |
| 32 | 0.62 | 39.1% | 19.4% | 41.5% |
| 64 | 0.65 | 37.8% | 20.0% | 42.3% |
| 128 | 0.67 | 37.5% | 19.2% | 43.3% |
| 256 | 0.79 | 33.8% | 19.2% | 47.0% |
| 512 | 0.94 | 31.2% | 21.9% | 46.9% |
| 1024 | 1.28 | 26.0% | 21.1% | 52.9% |
| 1536 | 1.61 | 23.3% | 21.7% | 55.0% |
| 2048 | 2.35 | 20.5% | 23.3% | 56.2% |
| 4096 | 3.74 | 17.7% | 23.2% | 59.1% |
The percentages are the relative proportions of the overall time that the left shift, the inversion, and the binary anding take. Timings are from Python 2.4.4 on a 64-bit Fedora Core 6 machine with an AMD Athlon 4600+ CPU.
2007-08-10
Clever large integers
Where bitmap is a Python long, consider the innocent looking Python
expression:
bitmap = bitmap & ~(1L << bit)
This is the perfectly normal way of clearing a bit in a bitmap, so it looks like there's nothing peculiar going on here.
Except that Python longs are of indefinite (and theoretically infinite)
length, and what makes this expression work in normal integers is that
the two pieces have the same (finite) size. So Python is clearly not
just doing straight binary operations here; the implementation of ~
and & for longs has to be more complicated than just manipulating bit
representations.
All of this goes to illustrate that sometimes features aren't as simple as they look.
As it turns out, this cleverness has a cost, at least in Python. If
you already know that the bit is set, clearing it with bitmap - (1L
<< bit) starts out being about twice as fast as the AND approach
(when operating on bit 0) and gets better from there. I'm ambivalent
about whether or not you want to check the bit's status ahead of
setting or clearing it; while it may avoid extra object allocations
and deallocations, it's relatively expensive and I haven't completely
convinced myself that it actually avoids all that much object churn.
(Checking plus subtraction is usually slower than the AND approach alone; it gets slightly faster when we start operating around bit 2048 and up. Of course, large bits is exactly where you probably want a faster representation of your bitmaps than a large integer, which should speed everything up.)
2007-08-06
Implementing a preforking network server in Python
I recently read a great explanation of the general Apache 2 preforking server model, and found it so lucid and simple that I was immediately inspired to implement a generic preforking server myself. Their approach has a bit of overhead in parent/child communication, but it does mean that you do not have to pass file descriptors between processes; all you need is some way for the parent to talk with its children.
(Since Python's standard library doesn't have support for file descriptor passing, the pragmatic advantages of this approach are immense.)
To summarize, the core of this model is that the parent is the only
thing that notices when there is a new connection, but it does not do
the accept() itself; instead it picks an idle child and commands the
child to handle the connection. In implementing my version I ran into
some subtleties, so I figure I might as well write them down here for
posterity:
- the server socket must be non-blocking and the entire scheme must
be able to cope with
accept()s that fail, because this can happen. If the kids ever block inaccept(), things start going horribly wrong; at a minimum you descend to a thundering herd situation.(Remember to set the newly
accept()'d socket back to blocking; there are some platforms where it will inherit the server socket's non-blocking status.) - the 'accept a new connection' step has to be synchronous between
the parent and the child, and the child must reply with its status
only after it has done the
accept(). Otherwise you can have a race between the parent returning to itsselect()and the child pulling the new connection off; if the parent wins the race it will order more than one child to handle the same connection.This means that there are only two times kids send asynchronous status messages to the parent: their initial 'I am idle' message on startup, and the messages they send after they've completed processing a connection.
- the parent should deal with status reports from kids before trying
to dispatch a new pending connection; this maximizes the chance of
knowing that you have idle kids.
- I found that I needed a status code from the child to the parent
to signal 'the processing function has asked the server to shut
down'. This is because the simplest place to put checks for
signals to shut down is in the application's 'new connection
handler' function, which is only called in kids.
(An alternate approach is to have a second function that's called every time through the parent's main loop.)
- because I was concerned about resilience, I added timeouts for
synchronous communication (for example, commanding a child to
accept a new connection and getting a status reply from it). If
the timeout expires without a good answer, the parent assumes
that something horrible has gone wrong and kills the child.
- asynchronously starting kids (for example to maintain a minimum
pool of idle workers) raises the issue of dealing with kids that
don't start properly for some reason. My approach is to put such
kids on a list until they report their initial 'idle' status. If the
parent needs to dispatch a new connection and there are no idle kids,
it picks a pending kid and synchronously waits for its status report;
if this times out it kills the kid and tries again.
- there are two ways of handling too many connections. The graceful
way is that if the parent cannot dispatch a new connection to a kid,
it stops checking the server socket until it has at least one idle kid;
otherwise the parent will just continually spin between
select()and failing to dispatch the new connection.The abrupt way to handle overload is to have the parent
accept()and then immediately close the new connection if it cannot dispatch the new connection to a kid.
While the original description uses socketpairs to communicate between
the parents and kids, my implementation uses a pair of pipes because
Python 2.3.3 doesn't have socket.socketpair() (it only appears in
Python 2.4).
I'm quite happy with how relatively easy it was to write an implementation of this approach and the performance of my code. It is now the core of DWiki's SCGI server, and has handily dealt with the performance issues of the previous approach without having any of the drawbacks of the other solutions I've considered.
(And it stood up fine to a recent pounding.)
If anyone is interested, the code is available as prefork.py, unfortunately without unit tests but hopefully with enough documentation to be usable (the comments probably need some revision). It is currently GPL2 because I know the verbiage to slap on a file for that license; I would be happy to redo it under the Python license once I know the right incantation.
2007-08-05
Thinking about more text formatting for DWiki
What DWiki is used to write about has drifted a fair bit from what I originally wrote it for, and some of the choices I originally made no longer seem so ideal. I've been thinking for a while that DWikiText, the wiki-text formatting language here, needs some more formatting features; this entry is at least in part me thinking out loud about them.
So, here are the new features I think I want:
- 'literal words', words that appear exactly was written without
any special characters in them doing things.
I really want a way of writing about the x86_64 architecture without having to quote it all the time; currently I write it as [[x86_64|]], and you don't want to know how I had to write that quoted version.
- a more aesthetic way of quoting things in general; the current
approach is too ugly to make me happy if it's used very often.
I think the syntax I like best for this is ``...''.
- a way of doing literal text that linewraps,
because almost always this is what I want. If I was doing DWikiText
over from scratch, this would be the default for literal text and
non-linewrapping literal text would be the special case, but as it
is I am basically stuck.
- a way of turning off special characters in text, especially the
_ character. Underscore was a great special character when
I was planning to mostly write about Unix commands; it is a bad
one when I am talking about programs, because it's a common
character in identifiers.
- I like the idea of some sort of simple macro scheme, on the order of what Mark Dominus did for his Blosxom installation, for much the same reason as he wrote his.
I've come up with trial implementations of most of these features (I've actually been sitting on the last two for some time), but so far I can't decide if I really like them all. Since once I start actually using them, I am more or less stuck (especially if they appear in comments), I feel I need to be extra sure before I commit to them.
(I have already been stuck with one or two bad choices that linger only for backwards compatibility with existing text.)
I could avoid a fair amount of the problem if I was willing to move seriously to using Unicode characters in my entries. For example, my previous ugly example could be rewritten as a big quoted block with literal Unicode arrow characters. Using Unicode characters would also give me lots of choices for special formatting characters, and would actually make escaping them reasonably acceptable if I ever needed to write them literally; I could just use the existing mechanism I have for arbitrary Unicode characters.
(And I am peculiar enough to get amusement out of using, say, 「 and 」 as my literal quotation characters.)
2007-08-03
Using WSGI for performance tuning
Doing profiling and performance tuning on a web application is normally a big pain. It's hard to profile things when processes (or threads) are always coming and going, and the typical web environment has so many sources of random delays that it is difficult to do fine-grained performance comparisons between two bits of code.
With WSGI, you can get out of this. Your WSGI application doesn't deal with the web directly, so you can write a WSGI frontend that is just a timing and profiling harness: it mocks up WSGI requests, fires them into your application, and discards the results. Unless your WSGI application itself forks or is threaded, profiling is easy, and in any case you can now time just your application instead of the entire system.
(This works in reverse, too; you can make up a 'null WSGI application' for when you are doing performance tuning on your WSGI frontend.)
The potential fly in this ointment is that your frontend needs to initialize your application somehow. Hopefully this is something you can extract from your usual WSGI framework.
My tuning frontend for DWiki is done this way; it's a command line program that I give a URL and how many requests to make, and it either profiles the run or tells me the time per request. It's quite convenient and easy to use, which makes it more likely that I'll bother to check the performance effects of code changes.
(Opinions may differ on whether this is a good thing; premature worrying about optimization is one of the roots of evil.)