Wandering Thoughts archives

2006-05-29

An obvious way to do bulk initialization of dictionaries

Every so often in my Python programs I need to initialize a dictionary with a whole bunch of values (and then pass it off somewhere). For a long time, my usual approach to this was:

d = {}
d['a'] = b.what
d['c'] = foo(d)
....

Recently I stumbled over the better way to do this, which is embarrassingly obvious in retrospect:

d = {
      'a': b.what,
      'c': foo(d),
      'e': bar(f, 28),
      ....
      }

As my example shows, the initial values can of course be any expression, not just simple values (which has been one of the reasons I tended to wind up writing the 'd['a'] = b.what' form). And with conditional expressions (either the current 'A and B or C' hack or the real version that will show up in Python 2.5), you can go even further in what can be swallowed into a one-liner initializer.

Of course, you can also use this to add several things to an existing dictionary:

d.update({
           'a': (b1, b2),
           'c': foo(d),
           ...
           })

(Although at this point I start thinking about creating a temporary dictionary to stuff all the values in and then doing 'd.update(tempd)', because otherwise the code looks a bit peculiar to me.)

It's humbling to keep discovering Python idioms like this, even after years of off and on programming in Python. Since I often discover them by reading other people's code, I probably should make more of an effort to seek out and read good Python code.

(I believe I stumbled over this idiom in someone's WSGI server code.)

BulkDictionaryInitialization written at 00:43:10; Add Comment

2006-05-20

Things that irritate me about Python's socket module

I am afraid that I am too busy listening to the Men Without Hats to write too much tonight, so I am just going to trot out some of my irritations with Python's socket module. All of these are on Unix machines.

(I suspect that portability to Windows is the reason for some of these, but it doesn't mean that I'm not irritated by them.)

  • socket.error. Let me count the ways:

    • most instances should actually be IOError or OSError instead.
    • it has two entirely different formats; often the only sensible thing you can do with a socket error is to str() it.
    As a minimum there could be a socket.ioerror sub-class, like socket.herror, that guaranteed a single (errno, string) format.
  • sockets do not support file-like .read() and .write(), so all code you write has to know that it is dealing specifically with a socket. Almost all code doesn't care, and would be better off without this difference. (A great deal of code is probably incorrectly using .send() instead of .sendall() as a result, too.)

  • SSL sockets reverse this; they have .read() and .write(), but not the usual socket set. So code you write has to care whether your socket connection is SSL-wrapped or not.
  • SSL sockets lack any means of closing down the connection. Not only do you not get a .shutdown(), you don't even get a .close(). Apparently you are supposed to shoot them in the head or something.
  • for extra fun, an SSL connection throws an exception when it closes down. Even if it closes in a perfectly orderly manner because the other end told it using the right SSL magic.

I have a Python program that tries to do a simple network cat for SSL connections; these interface issues make it absurdly annoying, and the comments are fully of grumpy rants about it. (The program does work, and I suppose I should put it up on the net sometime.)

Despite all this, I have to say that the socket module makes dealing with the BSD sockets API relatively simple and clear while keeping pretty much all of the features available. Writing socket using programs in Python is significantly easier than doing so in C. (And somewhat easier than doing so in Perl, because Perl forces a somewhat lower level view of the whole mess.)

(Update: I think I've unfairly maligned Perl in my aside; see the comments.)

SocketModuleIrritations written at 03:29:45; Add Comment

2006-05-16

A Python limit I never expected to run into

I've known for a while that the Python interpreter has a few little internal limits and implementation quirks. I can accept that; limits on the sizes of internal objects are not unexpected, and you have to draw the line somewhere.

But I was surprised to discover that the Python interpreter has a limit on how many arguments you can write in a function call; 255, as it happens.

(The limitation is on explicit arguments in the source code; you can call functions with many more arguments if you build a list and then use 'func(*lst)' notation.)

You might ask how anyone writes a function call with 255 arguments. In my case, one argument at a time; the function in question compiles a bunch of IP netblocks and ranges, in string form, into a set type object that other IP addresses get matched against. When I was only putting a few netblocks in the set, having them be function arguments made sense. (In fact I think that it still makes sense; I have to list them somehow, unless I exile them to a separate data file, so I might just do it directly as function arguments.)

I probably should have noticed that something was wrong earlier, but I'm bad at noticing my programs growing unless they do it in large jumps. If they just grow by a function here and some entries in a list there, without forcing me to step back for an overall view, a few hundred lines can transform into a two thousand line hulk without me really noticing.

The nitty gritty details

Because I got curious and looked it up in the CPython source: the actual limit is 255 positional arguments plus an additional 255 keyword arguments. It exists because of how the argument counts are encoded into the Python bytecode; each one is effectively restricted to a byte. Explicit func(*lst) style function invocation sidesteps this.

Some quick experimentation with compile() suggests that there is no limit on how many parameters a function can be declared with.

MaxFunctionArgs written at 00:31:58; Add Comment

2006-05-12

The problem with preforking Python network servers

I've been thinking about ways around the practical cost of forking in Python. There's two common alternatives: preforking servers and threads in general. However, both of them have issues that make me unhappy with them.

The best setup for a preforking SCGI server is a central dispatcher that parcels new connections out to a pool of worker processes; this requires the ability to pass file descriptors to other processes. While Unix can do this (with SCM_RIGHTS messages over Unix domain sockets), Python doesn't support this part of the Unix sockets API.

This leaves you with the preforked workers all sitting around waiting in select() for a new SCGI connection or instructions from the master process (such as 'please exit now'). When a new SCGI connection comes in, all of them wake up in a thundering herd; one of them wins the race to accept() the new connection and everyone else goes back to select() to wait. The more worker processes, the bigger the herd.

Pragmatically the thundering herd issue is unlikely to be noticed on a modern computer, partly because you don't want to run that many worker processes anyways. But its mere existence annoys me, and the lack of a central dispatcher means that you have to pre-start all the workers and can't start and stop them based on connection flux. (This has a silver lining: just starting a fixed number of workers and keeping them running is less code.)

I may still code a preforking version of the SCGI server just to see how it goes and for the experience, but I suspect I'm not going to run it in production. Systems speed up, but unappetizing code is forever.

The problems with threads

There are several annoyances with threads:

  • I'd lose process isolation, so a code bug could rapidly contaminate the entire SCGI server.
  • This isn't a good match for Python threads because my SCGI server is mostly CPU bound.
  • due to the Linux NPTL thread issue the process would use up a lot of virtual memory, and it just makes me twitchy to see my SCGI server sitting around using many megabytes of virtual memory.

I could do a threaded or thread-pool based SCGI server, but I'd be left with the feeling that it was a big hack. It'd barely be a step up from a single-threaded server that only handled one connection at a time. (There's some disk IO and network IO that multiple threads might be able to take advantage, but probably not too much. Unfortunately measuring true parallelism opportunities is a bit tricky.)

PreforkingProblem written at 02:05:59; Add Comment

2006-05-10

The practical cost of forking in Python

I spent part of the other day working to speed up an SCGI based program, and wound up hitting a vivid illustration of the practical cost of forking in Python. I'll start with the numbers:

  • 5.3 milliseconds per request when the program forked a child to handle each request.
  • 1.1 milliseconds per request when the forking was stubbed out so each request ran in the main process.

Benchmarking was done with Apache's ab, running on the same machine (and with only one request at a time, since the non-forking version obviously can't handle concurrent requests).

These numbers are pure SCGI overhead; the program had its usual response handler stubbed out to a special null handler that just returned a short hard-coded response, and it was directly connected to lighttpd. (Some work suggests that most of the remaining 1.1 millisecond is in decoding the request's initial headers; I'm not sure how to speed this up.)

Since I have a thread pool package lying around, I hacked the SCGI server up to use it instead of forking; the performance stayed around 1.1 milliseconds per request, somewhat to my surprise.

I don't have any explanation of why Python takes 4.2 milliseconds more when I fork for each request. The direct cost of fork() with all of the program's modules imported is about 1.3 milliseconds (the fork tax varies with how many dynamic libraries the Python process has loaded, so it's important to measure with your program's actual set of imports). Forking does require extra management code to do things like track and reap dead children, but 2.9 milliseconds seems a bit high for it.

PythonForkCost written at 02:27:52; Add Comment

By day for May 2006: 10 12 16 20 29; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.