Wandering Thoughts archives

2007-08-11

On the performance of Python longs being used as bitmaps

A comment on the previous entry got me interested in what exactly cost all the time when using Python longs as bitmaps. Some poking shows that the time & takes scales up based on the smaller long involved, and the time | takes scales based on the larger long involved. While this is not particularly surprising, it means that setting and clearing any bit has a cost based on how big the bitmap is, ie, what the largest set bit is.

(Checking low bits in a large bitmap is relatively cheap, since one side of the & will be small.)

There are two core performance issues with longs as bitmaps: there's no bit addressing, and you can't mutate a long in place so any change to your bitmap requires allocating a new object. The first is the most important, because it means you are making longs left and right and then doing big expensive operations merely to deal with one bit.

While Python could in theory add bit addressing to longs (allowing you to get and set a specific bit), I doubt it will, since I suspect that the need is relatively esoteric and you can probably do about as well with a list of Python integers (or a C level module if you really need speed).

Sidebar: some numbers

Here's a performance chart for bmap & ~(1L << bit), where bmap has only bit and bit+1 set:

bit time (µsec) doing the << doing the ~ doing the &
0 0.59 40.1% 20.2% 39.7%
1 0.59 41.1% 21.4% 37.4%
8 0.60 40.4% 20.0% 39.6%
16 0.63 39.7% 18.2% 42.1%
32 0.62 39.1% 19.4% 41.5%
64 0.65 37.8% 20.0% 42.3%
128 0.67 37.5% 19.2% 43.3%
256 0.79 33.8% 19.2% 47.0%
512 0.94 31.2% 21.9% 46.9%
1024 1.28 26.0% 21.1% 52.9%
1536 1.61 23.3% 21.7% 55.0%
2048 2.35 20.5% 23.3% 56.2%
4096 3.74 17.7% 23.2% 59.1%

The percentages are the relative proportions of the overall time that the left shift, the inversion, and the binary anding take. Timings are from Python 2.4.4 on a 64-bit Fedora Core 6 machine with an AMD Athlon 4600+ CPU.

BitmapLongPerformance written at 14:19:58; Add Comment

2007-08-10

Clever large integers

Where bitmap is a Python long, consider the innocent looking Python expression:

bitmap = bitmap & ~(1L << bit)

This is the perfectly normal way of clearing a bit in a bitmap, so it looks like there's nothing peculiar going on here.

Except that Python longs are of indefinite (and theoretically infinite) length, and what makes this expression work in normal integers is that the two pieces have the same (finite) size. So Python is clearly not just doing straight binary operations here; the implementation of ~ and & for longs has to be more complicated than just manipulating bit representations.

All of this goes to illustrate that sometimes features aren't as simple as they look.

As it turns out, this cleverness has a cost, at least in Python. If you already know that the bit is set, clearing it with bitmap - (1L << bit) starts out being about twice as fast as the AND approach (when operating on bit 0) and gets better from there. I'm ambivalent about whether or not you want to check the bit's status ahead of setting or clearing it; while it may avoid extra object allocations and deallocations, it's relatively expensive and I haven't completely convinced myself that it actually avoids all that much object churn.

(Checking plus subtraction is usually slower than the AND approach alone; it gets slightly faster when we start operating around bit 2048 and up. Of course, large bits is exactly where you probably want a faster representation of your bitmaps than a large integer, which should speed everything up.)

CleverLargeIntegers written at 23:21:47; Add Comment

2007-08-06

Implementing a preforking network server in Python

I recently read a great explanation of the general Apache 2 preforking server model, and found it so lucid and simple that I was immediately inspired to implement a generic preforking server myself. Their approach has a bit of overhead in parent/child communication, but it does mean that you do not have to pass file descriptors between processes; all you need is some way for the parent to talk with its children.

(Since Python's standard library doesn't have support for file descriptor passing, the pragmatic advantages of this approach are immense.)

To summarize, the core of this model is that the parent is the only thing that notices when there is a new connection, but it does not do the accept() itself; instead it picks an idle child and commands the child to handle the connection. In implementing my version I ran into some subtleties, so I figure I might as well write them down here for posterity:

  • the server socket must be non-blocking and the entire scheme must be able to cope with accept()s that fail, because this can happen. If the kids ever block in accept(), things start going horribly wrong; at a minimum you descend to a thundering herd situation.

    (Remember to set the newly accept()'d socket back to blocking; there are some platforms where it will inherit the server socket's non-blocking status.)

  • the 'accept a new connection' step has to be synchronous between the parent and the child, and the child must reply with its status only after it has done the accept(). Otherwise you can have a race between the parent returning to its select() and the child pulling the new connection off; if the parent wins the race it will order more than one child to handle the same connection.

    This means that there are only two times kids send asynchronous status messages to the parent: their initial 'I am idle' message on startup, and the messages they send after they've completed processing a connection.

  • the parent should deal with status reports from kids before trying to dispatch a new pending connection; this maximizes the chance of knowing that you have idle kids.

  • I found that I needed a status code from the child to the parent to signal 'the processing function has asked the server to shut down'. This is because the simplest place to put checks for signals to shut down is in the application's 'new connection handler' function, which is only called in kids.

    (An alternate approach is to have a second function that's called every time through the parent's main loop.)

  • because I was concerned about resilience, I added timeouts for synchronous communication (for example, commanding a child to accept a new connection and getting a status reply from it). If the timeout expires without a good answer, the parent assumes that something horrible has gone wrong and kills the child.

  • asynchronously starting kids (for example to maintain a minimum pool of idle workers) raises the issue of dealing with kids that don't start properly for some reason. My approach is to put such kids on a list until they report their initial 'idle' status. If the parent needs to dispatch a new connection and there are no idle kids, it picks a pending kid and synchronously waits for its status report; if this times out it kills the kid and tries again.

  • there are two ways of handling too many connections. The graceful way is that if the parent cannot dispatch a new connection to a kid, it stops checking the server socket until it has at least one idle kid; otherwise the parent will just continually spin between select() and failing to dispatch the new connection.

    The abrupt way to handle overload is to have the parent accept() and then immediately close the new connection if it cannot dispatch the new connection to a kid.

While the original description uses socketpairs to communicate between the parents and kids, my implementation uses a pair of pipes because Python 2.3.3 doesn't have socket.socketpair() (it only appears in Python 2.4).

I'm quite happy with how relatively easy it was to write an implementation of this approach and the performance of my code. It is now the core of DWiki's SCGI server, and has handily dealt with the performance issues of the previous approach without having any of the drawbacks of the other solutions I've considered.

(And it stood up fine to a recent pounding.)

If anyone is interested, the code is available as prefork.py, unfortunately without unit tests but hopefully with enough documentation to be usable (the comments probably need some revision). It is currently GPL2 because I know the verbiage to slap on a file for that license; I would be happy to redo it under the Python license once I know the right incantation.

PreforkingNetworkServer written at 22:19:14; Add Comment

2007-08-05

Thinking about more text formatting for DWiki

What DWiki is used to write about has drifted a fair bit from what I originally wrote it for, and some of the choices I originally made no longer seem so ideal. I've been thinking for a while that DWikiText, the wiki-text formatting language here, needs some more formatting features; this entry is at least in part me thinking out loud about them.

So, here are the new features I think I want:

  • 'literal words', words that appear exactly was written without any special characters in them doing things.

    I really want a way of writing about the x86_64 architecture without having to quote it all the time; currently I write it as [[x86_64|]], and you don't want to know how I had to write that quoted version.

  • a more aesthetic way of quoting things in general; the current approach is too ugly to make me happy if it's used very often. I think the syntax I like best for this is ``...''.

  • a way of doing literal text that linewraps, because almost always this is what I want. If I was doing DWikiText over from scratch, this would be the default for literal text and non-linewrapping literal text would be the special case, but as it is I am basically stuck.

  • a way of turning off special characters in text, especially the _ character. Underscore was a great special character when I was planning to mostly write about Unix commands; it is a bad one when I am talking about programs, because it's a common character in identifiers.

  • I like the idea of some sort of simple macro scheme, on the order of what Mark Dominus did for his Blosxom installation, for much the same reason as he wrote his.

I've come up with trial implementations of most of these features (I've actually been sitting on the last two for some time), but so far I can't decide if I really like them all. Since once I start actually using them, I am more or less stuck (especially if they appear in comments), I feel I need to be extra sure before I commit to them.

(I have already been stuck with one or two bad choices that linger only for backwards compatibility with existing text.)

I could avoid a fair amount of the problem if I was willing to move seriously to using Unicode characters in my entries. For example, my previous ugly example could be rewritten as a big quoted block with literal Unicode arrow characters. Using Unicode characters would also give me lots of choices for special formatting characters, and would actually make escaping them reasonably acceptable if I ever needed to write them literally; I could just use the existing mechanism I have for arbitrary Unicode characters.

(And I am peculiar enough to get amusement out of using, say, 「 and 」 as my literal quotation characters.)

MoreDWikiFormatting written at 22:52:45; Add Comment

2007-08-03

Using WSGI for performance tuning

Doing profiling and performance tuning on a web application is normally a big pain. It's hard to profile things when processes (or threads) are always coming and going, and the typical web environment has so many sources of random delays that it is difficult to do fine-grained performance comparisons between two bits of code.

With WSGI, you can get out of this. Your WSGI application doesn't deal with the web directly, so you can write a WSGI frontend that is just a timing and profiling harness: it mocks up WSGI requests, fires them into your application, and discards the results. Unless your WSGI application itself forks or is threaded, profiling is easy, and in any case you can now time just your application instead of the entire system.

(This works in reverse, too; you can make up a 'null WSGI application' for when you are doing performance tuning on your WSGI frontend.)

The potential fly in this ointment is that your frontend needs to initialize your application somehow. Hopefully this is something you can extract from your usual WSGI framework.

My tuning frontend for DWiki is done this way; it's a command line program that I give a URL and how many requests to make, and it either profiles the run or tells me the time per request. It's quite convenient and easy to use, which makes it more likely that I'll bother to check the performance effects of code changes.

(Opinions may differ on whether this is a good thing; premature worrying about optimization is one of the roots of evil.)

WSGIForTuning written at 00:54:38; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.