Python and the 'bags of unstructured data' approach

March 20, 2018

These days I write code in both Go and Python, which sometimes gives me interesting new perspectives on each language as I shift back and forth. I was recently hacking on a Python program to mutate it into what I wanted, and as I did so what struck me is how Python's dynamic typing and everything around it enabled a specific approach that I'll call the 'bag of data' approach.

The base code I was starting with parses Linux's /proc/self/mountstats to get at all of the NFS statistics found there. All of the data fields in these statistics have defined meanings, meanings that this Python code knew, so it could have opted to use some kind of structures for them with actual named fields (perhaps using namedtuple). However, it didn't. Instead the code dumps everything into a small collection of dicts using various named keys, then yanks bits back out again as it needs them (and knows what structure each key's data will have).

This 'bag of data' approach only works in a dynamic language like Python, because there's no structure or typing to what goes where. A given key may give you a string, a number, an list of either, a sub-dict, or whatever. On the one hand this is harder to follow than something with fixed, named fields. On the other hand it's marvelously flexible and easy to manipulate and transform, especially in bulk. In theory you could do the same sort of thing with named fields (in Python), but in practice it is just easier to write code when you're dealing with dictionary keys and values, because getting lists of them and accessing arbitrary ones and doing indirection is really simple. With a dict that's a bag of data, it's natural to write code like this:

datas = [self.__rpc_data[x] for x in self.__rpc_data['ops']]
sumrpc = [sum(x) for x in zip(*datas)]

This isn't necessarily the best way to do things in final code, once you're sure you know what you need, but in the mean time this plasticity makes it very easy to experiment by transforming and mutating and remixing various pieces of data in the bag in convenient and quick to write ways. When 'sum fields together across all of the different NFS RPC operations' is a two liner, you're much more likely to try it if you think something interesting might result.

(One way that this is potentially flawed is that not all statistics fields in NFS RPC operations may make sense to sum together. But that's up to you to keep track of and sort out, because that's the tradeoff you get here.)

There's other nice things you can do with the bag of data approach. For example, it gives you a relatively natural way to deal with data that isn't always there. Python has lots of operations for checking if keys are in dicts, getting the value of a key with a default value if it's not there, and so on. You can build equivalents of all of these for named fields, but it's more work and isn't likely to feel as natural as 'if key in databage: ...'.

Another thing I've done in code is to successively refine my bag of data in multiple passes. My bag of data starts out with only very basic raw fields, then I generate some calculated fields and add them to the bag, then another pass derives an additional set of fields, and so on. Again, you can do this pattern with named fields, but it isn't a natural fit; probably the right way to do it with structured data is a series of different structures, perhaps embedding the previous ones. Pragmatically it's easier to write passes that simply add and update dictionary keys, though, partly because the knowledge of what fields exist when and where can be very localized.

(This is especially the case if what you're dealing with is a tree of data and you may want to run a pass over each node in the tree, where different nodes may be of different types. When everything is a dictionary it's easy to write generic code that only acts on certain things; life gets messier if you must carefully sniff out if the object you have is the right type of object for your pass to even start looking at.)


Comments on this page:

As an aside – pedantically speaking, you can of course do this just as well in a non-dynamic language. So in Go, you can use maps and casts and so on to achieve the same thing. It’s just that a non-dynamic language forces you to be explicit about the exact degree of dynamism you want. Or, to quote Larry Wall, “languages differ not so much in what you can say but in what you must say”.

Written on 20 March 2018.
« Some exciting ZFS features that are in OmniOS CE's (near) future
You probably don't want to run Firefox Nightly any more »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Tue Mar 20 00:13:27 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.