Wandering Thoughts archives

2014-03-17

Simple versus complex marshalling in Python (and benchmarks)

If you have an external caching layer in your Python application, any caching layer, one of the important things that dictates its speed is how fast you can turn Python data structures into byte blobs, stuff them into the cache, and then get byte blobs back from the cache and turn them back into data structures. Many caches will store arbitrary blobs for you so your choice of marshalling protocols (and code) can make a meaningful difference. And there are a lot of potential options; marshal, cPickle, JSON, Google protobuf, msgpack, and so on.

One of the big divisions here is what I could call the JSON verus pickle split, namely whether you can encode and decode something close to full Python objects or whether you can only encode and decode primitive types. All else being equal it seems like you should use simple marshalling, since creating an actual Python class instance necessarily has some overhead over and above just decoding primitive types. But this leaves you with a question; put simply, how is your program going to manipulate the demarshalled entities?

In many Python programs these entities would normally be objects, partly because objects are the natural primitive of Python (among other reasons, classes provide convenient namespaces). This basically leaves you with two options. If you work with objects but convert them to and from simple types around the cache layer, you've really built your own two-stage complex marshalling system. If you work with simple entities throughout your code you're probably going to wind up with more awkward and un-Pythonic code. In many situations what I think you'll really wind up doing is converting those simple cache entities back to objects at some point (and converting from objects to simple cache entities when writing cache entries).

Which brings me around to the subject of benchmarks. You can find a certain amount of marshalling benchmarks out there on the Internet, but what I've noticed is that they're basically all benchmarking the simple marshalling case. This is perfectly understandable (since many marshalling protocols can only do primitive types) but not quite as useful for me as it looks. As suggested above, what I really want to get into and out of the cache in the long run is some form of objects, whether the marshalling layer handles them for me or I have to do the conversion by hand. The benchmark that matters for me is the total time starting from or finishing with the object.

With that said, if caches are going to be an important part of your system it likely pays to think about how you're going to get entries into and out of them efficiently. You may want to have deliberately simplified objects near the cache boundaries that are mostly thin wrappers around primitive types. Plus Python gives you a certain amount of brute force hacks, like playing games with obj.__dict__.

(I don't have any answers here, or benchmark results for that matter. And I'm sure there's situations where it makes sense to go with just primitive types and more awkward code instead of using Python objects.)

Sidebar: The other marshalling benchmark problem

Put simply, different primitive types generally encode and decode at different speeds (and the same is true for different sizes of primitive types like strings) This means you need to pay attention to what people are encoding and decoding, not just what the speed results are; if they're not encoding something representative of what you want to, all bets may be off.

(My old tests of marshal versus cPickle showed some interesting type-based variations of this nature.)

You can also care more about decoding speed than encoding speed, or vice versa. My gut instinct is that you probably want to care more about decoding speed if your cache is doing much good, because getting things back from the cache (and the subsequent decodes) should be more frequent than putting things into it.

SimpleVsComplexMarshalling written at 23:00:53; Add Comment

2014-03-13

The argument about unbound methods versus functions

As mentioned in How functions become bound or unbound methods, in Python 2 when you access a function on a class (eg you do cls.func) it becomes an 'unbound method'. In Python 3 this is gone, as Peter Donis mentioned in a comment on that entry; if you do cls.func you get back the plain function. I'm not entirely sure how I feel about this, so let's start by asking a relevant question: what's the difference between an unbound method and the underlying function?

The answer is that calling an unbound method adds extra type checking on the first argument. When cls.func is an unbound method and you call it, the first argument must be an instance of cls or a subclass of it (ie, something for which isinstance() would return True). If it isn't you get a TypeError (much like the ones we saw back here). Calling the function directly has no such requirement; you can feed it anything at all, even though as a class method it's probably expecting an instance of the class as its first argument.

I'll admit that it's an open argument whether this type checking is a good thing or not. It's certainly atypical for Python and the conversion from a plain function into an unbound method is a bit surprising to people. There aren't that many situations in Python where making something an attribute of a simple class magically changes it into something else; normally you expect 'A.attr = something; A.attr' to give you 'something' again. The argument in defense of the checking is that it's a useful safety measure for functions that are almost certainly coded with certain assumptions and directly calling class methods on the class is not exactly a common thing.

Now that I've written this entry, I can see why Python 3 took unbound methods out. They might be handy but they're not actually essential to how things work (unlike bound methods) and Python's mechanics are mostly about what actively has to be there. I guess my view is now that I don't mind them in Python 2 but I doubt I'm going to miss them in Python 3 (if I ever do anything with Python 3).

UnboundMethodsVsFunctions written at 01:18:25; Add Comment

2014-03-11

How functions become bound or unbound methods

Suppose that you have a class:

class A(object):
   def fred(self, a):
      print "fred", self, a

Then we have:

>>> a = A()
>>> A.fred
<unbound method A.fred>
>>> b = a.fred
>>> b
<bound method A.fred of <__main__.A object at 0x1b9c210>>

An unbound method is essentially a function with some trimmings. A 'bound method' is called that because the first argument (ie self) is already set to a; you can call b(10) and it works just the same way as if you had done a.fred(10) (this is actually necessary given how CPython operates). So far so good, but how does Python make this all work?

One way that people sometimes explain how Python makes this work is to say that A.fred has been turned into a Python descriptor. This is sort of true but it is not quite the full story. What is really going on is that functions are leading a double life: functions are also descriptors. All functions are descriptors all of the time, whether or not they're in a class. At this point you might rationally ask how a bare function (outside of a class) manages to still work; after all, when you look at it or try to call it, shouldn't the descriptor stuff kick in? The answer is descriptors only work inside classess. Outside of classes, descriptors just sort of sit there and you can access them without triggering their special behavior; in the case of functions, this means that you can call them (and look at their attributes if you feel so inclined).

So the upshot is that if you look at a function outside of a class, it is a function and you can do all of the regular functiony things with it. If you look at it inside a class it instantly wraps itself up inside a bound or unbound method (which you can then pry the original function back out of if you're so inclined). This also neatly explains why other callables don't get wrapped up as bound or unbound methods; they aren't (normally) also descriptors that do this.

This is rather niftier than I expected it to be when I started digging. I'm impressed with Python's cleverness here; I would never have expected function objects to be living a double life. And I have to agree that this is an elegantly simple way to make everything work out just right.

(This entry was inspired by a question Pete Zaitcev asked, which started me wondering about how things actually worked.)

PS: All of this is actually described in the descriptor documentation in the Functions and Methods section. I just either never read that or never fully understood it (or forgot it since then).

Sidebar: Why CPython has to expose bound methods

Given how CPython makes calls, returning bound methods all the time is actually utterly essential. CPython transforms Python code to bytecode and in its bytecode there is no 'call <name>' operation; instead you look up <name> with a general lookup operation and then call the result. Since the attribute lookup doesn't know what the looked up value is going to be used for, it has to always return a bound method.

Of course, bound methods are also the right thing to do with method functions in general if you believe that functions are first class entities. It'd be very strange for 'b = a.fred; a.fred(10); b(10)' to hav the two function calls behave differently.

(The argument over returning unbound methods instead of the bare function is a bit more abstract but I think it's probably the right thing to do.)

HowFunctionsToMethods written at 23:50:57; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.