Wandering Thoughts archives

2009-04-20

Some ways to add versioning to pickled objects

To follow on from yesterday's entry, suppose that one is using (c)Pickle to save data in your Python program, and you want to version your data somehow. I can think of a number of approaches:

  • take the 'pickle as stored JSON' approach; serialize your complex objects to simple objects (dictionaries, lists, etc), add version numbers, and only pickle the simple objects. Then you can do all of the usual version mismatch fixups when you de-serialize the reloaded simple objects back to your complex objects.

  • version your class names; instead of having, say, a Comment class and pickling several different versions of it, have a CommentV1 class, a CommentV2 class, and so on. (I imagine that you will want a Comment abstract class to have all of the common behavior.)

  • don't explicitly version things. Take advantage of the fact that pickle doesn't actually initialize objects as such, just stuffs data into their __dict__, and write your object methods such that they will deal with any set of data they could get from any version of your objects. (Renaming instance variables may help.)

    The easiest approach to this is probably to call a fixup method on newly loaded objects; this method can then canonicalize old data versions into the current world.

  • define a custom __setstate__ method that works out what version of the data that it's restoring based on the contents of the dictionary that it's handed. This is essentially the fixup method approach, just automated, and you have to copy the data onto the object yourself.

All of these have drawbacks, and some of them are ugly. If I had to do any of these I would probably take the 'pickle as stored JSON' approach; although it is one of the more annoying choices (since you write a bunch of code), it is the least ugly.

(The custom __setstate__ approach has a pleasing minimalism but involves a little bit too much magic to make me happy.)

python/VersioningPickle written at 23:46:24;

Why pickle is not a good way to save your data

On the surface, the (c)Pickle module looks like a good, simple way for your Python program to save and load its data; much like XML, it means you don't have to write a parser or even save and load routines as such, just some file and object manipulation code. However, through my experience in writing DWiki I've come to understand that this temptation would be a mistake (one that I've actually half-made; DWiki's caching layer uses pickling).

Fundamentally the problems with pickle for saving data are inherent in what it exists to do; it exists to persist and recover Python objects, not save and restore data. These sound similar enough on first look, but in the longer term I think you run into some significant issues:

  • pickle has no concept of versioning for your data structures, which makes it hard to change the data that you store for a particular sort of thing. If you need this (and you will), you will have to resort to various workarounds to build it yourself.

    (In fact pickle doesn't even notice if there is a mismatch between what instance data was pickled for an object and what the object should now have.)

  • your data files are not easily inspectable. Yes, I know, pickle has an ASCII version of its storage protocol, but this is still not very readable by hand, and I don't think it's modifiable by hand at all (well, not practically). Essentially pickled things are opaque; the only way to deal with them sensibly is through pickle itself.

  • I don't think that pickle has any real concept of error recovery, and with it any way to get partial information for a partially complete data structure. You either get the whole object (or object hierarchy) or you get nothing.

This is not to say that pickle is pointless. It's just that if you're using it, you need to be sure that you really do want objects, not just data.

If you still want to use pickle as your save format because it's easy, I've come around to the idea that you should not attempt to pickle your objects directly. Instead I think that you should treat pickle like you would JSON, and first serialize your actual objects into simple data structures (dictionaries, lists, etc) and pickle only the data structures.

(Admittedly, this is easy for me to say because my use of pickle to date has been for objects that are relatively easily represented this way.)

python/PickleNotForSaving written at 01:46:52;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.