Wandering Thoughts archives

2006-03-09

Making things simple for busy webmasters

It's always nice when people's software saves me from having to wonder if they're up to no good by handing out obvious signs of it. Take, for example, the spate of people whose web crawling software advertises itself by having the User-Agent string of:

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

Evidently no one told them not to stutter. (There are a couple of variations in what they claim to be, but that one is the most common. Needless to say, no real User-Agent string (MSIE's included) has an extra 'User-Agent: ' on the front.)

The IP addresses that sourced these are scattered all over; a couple of them are (still) on the XBL, and a couple are in SPEWS.

(And I give bonus points to the person with the User-Agent string "W3C standards are important. Stop fucking obsessing over user-agent already.", which I stumbled over while scanning our logs today. I can certainly agree with the sentiment.)

Another good one is the stealth spider that sends a completely blank Referer: header, instead of omitting it; it stands out like a sore thumb in my log scans. This comes from all over, with 157 different IP addresses over the past 28 days or so, 50 of them currently listed in the XBL.

web/ObviousNogoodniks written at 16:42:15; Add Comment

Closures versus classes for capturing state

Possibly like a lot of other people, I have a 'thread pool' Python module for doing decoupled asynchronous queries of things. The basic interface looks like this:

wp = WorkPool(HOWMANY)
for w in what:
  wp.enqueue(wrap_up(w))
wp.waitDone()
res = wp.getanswers()

The important operation is .enqueue(<THING>), which queues up <THING> to be called in a worker thread; its return value is saved, to be disgorged later by .getanswers().

To keep things simple, <THING> is called with no arguments. But usually what I really want to do is to call the same function with a bunch of different arguments, for example calling is_in_sbl for a bunch of IP addresses. To make this work, I need to capture the argument (or arguments) and create a simple callable object that remembers them; in this example, that's done by wrap_up.

There's two ways of wrapping up this sort of state in Python: classes (okay, objects) and closures. In the class-based approach, wrap_up is actually a class; its __init__ method saves w, and its __call__ method uses the saved w to do the actual function call. By contrast, in a closure-based approach wrap_up is a plain function that creates and returns closures, like this:

def wrap_up(w):
  return lambda: real_func(w)

The class approach is the more obvious and natural one, because the state saving is explicit. But when I wrote two structurally identical programs, the first one using a class and the second one with closures, I found that I plain liked closures better.

For me, the drawback of the class based approach is that it's too verbose. The same wrap_up would be:

class wrap_up:
  def __init__(self, w):
    self.w = w
  def __call__(self):
    return real_func(self.w)

(and it gets longer if you have more than one argument.)

I have a visceral reaction to verbosity; in cases like this it feels like make-work. The retort is that closures are far more magic than classes for most people and it's better to avoid magic, even if the result is more verbose.

For now, my view is that programming should be pleasant, not annoying, and that the verbosity of the class-based approach to state capture annoys me. So closures it is.

(Perhaps the real answer is that WorkPool should have an .enqueue_call(func, *args, **kwargs) method, as well as the plain .enqueue() one. This strays towards the Humane Interfaces arguments, though, so I'm a bit nervous.)

python/CapturingState written at 02:50:11; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.