Wandering Thoughts archives

2014-03-17

Simple versus complex marshalling in Python (and benchmarks)

If you have an external caching layer in your Python application, any caching layer, one of the important things that dictates its speed is how fast you can turn Python data structures into byte blobs, stuff them into the cache, and then get byte blobs back from the cache and turn them back into data structures. Many caches will store arbitrary blobs for you so your choice of marshalling protocols (and code) can make a meaningful difference. And there are a lot of potential options; marshal, cPickle, JSON, Google protobuf, msgpack, and so on.

One of the big divisions here is what I could call the JSON verus pickle split, namely whether you can encode and decode something close to full Python objects or whether you can only encode and decode primitive types. All else being equal it seems like you should use simple marshalling, since creating an actual Python class instance necessarily has some overhead over and above just decoding primitive types. But this leaves you with a question; put simply, how is your program going to manipulate the demarshalled entities?

In many Python programs these entities would normally be objects, partly because objects are the natural primitive of Python (among other reasons, classes provide convenient namespaces). This basically leaves you with two options. If you work with objects but convert them to and from simple types around the cache layer, you've really built your own two-stage complex marshalling system. If you work with simple entities throughout your code you're probably going to wind up with more awkward and un-Pythonic code. In many situations what I think you'll really wind up doing is converting those simple cache entities back to objects at some point (and converting from objects to simple cache entities when writing cache entries).

Which brings me around to the subject of benchmarks. You can find a certain amount of marshalling benchmarks out there on the Internet, but what I've noticed is that they're basically all benchmarking the simple marshalling case. This is perfectly understandable (since many marshalling protocols can only do primitive types) but not quite as useful for me as it looks. As suggested above, what I really want to get into and out of the cache in the long run is some form of objects, whether the marshalling layer handles them for me or I have to do the conversion by hand. The benchmark that matters for me is the total time starting from or finishing with the object.

With that said, if caches are going to be an important part of your system it likely pays to think about how you're going to get entries into and out of them efficiently. You may want to have deliberately simplified objects near the cache boundaries that are mostly thin wrappers around primitive types. Plus Python gives you a certain amount of brute force hacks, like playing games with obj.__dict__.

(I don't have any answers here, or benchmark results for that matter. And I'm sure there's situations where it makes sense to go with just primitive types and more awkward code instead of using Python objects.)

Sidebar: The other marshalling benchmark problem

Put simply, different primitive types generally encode and decode at different speeds (and the same is true for different sizes of primitive types like strings) This means you need to pay attention to what people are encoding and decoding, not just what the speed results are; if they're not encoding something representative of what you want to, all bets may be off.

(My old tests of marshal versus cPickle showed some interesting type-based variations of this nature.)

You can also care more about decoding speed than encoding speed, or vice versa. My gut instinct is that you probably want to care more about decoding speed if your cache is doing much good, because getting things back from the cache (and the subsequent decodes) should be more frequent than putting things into it.

python/SimpleVsComplexMarshalling written at 23:00:53;

Rebooting the system if init dies is a hack

I feel like I should say this explicitly: rebooting the system if init dies is a hack. It's the easy thing to do but not the right thing. V7 Unix more or less ignored the possibility of init failing; when BSD started considering this situation they took the easy way out of 'handling' it by just rebooting the system. Everyone since then has copied BSD (probably partly out of compatibility, since 'everyone knows' that if init dies the system reboots, and partly because it's the easy way).

You can argue that if init dies something terrible is going on (especially after the kernel has armored init so that you have to work very hard to terminate it) and this is generally true. But rebooting the system is the lazy way out, especially when this is determined by the kernel instead of user level. It might certainly be sensible to configure your system to immediately start a reboot if init ever dies and is restarted by the kernel, but at that point it's something you control at user level; you might instead ring lots of alarms and see if the system could limp on. And so on. From some perspectives, 'reboot the system if init dies' is the kernel meddling in policy that should be left to other levels.

The right thing is to provide some way to recover from this situation. I outlined two plausible approaches yesterday; there are probably more. Of course this is more work to design and program than just rebooting the machine, but that's common when you do the right thing instead of the easy thing.

It's kind of sad that almost everyone since BSD has simply followed or copied the BSD quick hack approach (even the people who reimplement things from scratch, like Linux) but this is pretty typical for Unix. If some Unix did try to do it differently I suspect that there would be people complaining that that Unix was over-complicating init.

unix/InitDeathAndRebootsII written at 01:33:05;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.