The accumulator mini-pattern and .setdefault()

February 13, 2009

What I'm going to call the accumulator (mini-)pattern is a common operation when summarizing a stream of data: you have some keys, which can repeat, and you want to accumulate some data each time each key comes up, to count it or sum it all up or keep a list of all of the data for each key.

In pretty much every language that has them, this pattern is done with dictionaries (or hashes or the language's equivalent). In Python, this creates the minor annoyance of initializing a key's entry the first time you see the key, so you wind up with annoying code that looks like this:

store = {}
def accum(k, v):
    if k not in store:
        store[k] = []

(In awk, one of the Unix progenitors of this pattern, unset elements default to zero so you can usually write just 'store[$1] = store[$1] + $2' or the like.)

There's a number of variations on the same basic idea; I have seen people write 'store[k] = store.get(k, 0) + v', for example. Which one you settle on depends partly on what operation you're doing (a default-value .get() is convenient for math, an 'if not in, add it' bit of code is convenient for data structures) and partly on which particular idiom feels natural to you.

For the 'if not in, add it' case one can often use the dict .setdefault() method to shorten the code:

def accum(k, v):
    store.setdefault(k, []).append(v)

(Opinions may be divided on whether this is uglier and more complicated in practice than the more verbose version.)

As it happens, I have to remind myself of .setdefault() every so often, and I've seen other people miss it too. I'm not sure why .setdefault() keeps slipping out of my mind; it may partly be because it has such an odd name for the operation it does, although I have to admit that coming up with a better one would be a challenge.

There is at least one case where .setdefault() is clearly worse. Consider:

def accum(k, v):
    if k not in store:
        store[k] = SomeClass()

If you wrote this with .setdefault(), you would be creating and then throwing away a SomeClass object every time the key had already been seen before, churning memory in the process. The more verbose code avoids this by only creating SomeClass objects when you actually need them.

Comments on this page:

From at 2009-02-13 02:06:53:

I prefer collections.defaultdict myself.


By cks at 2009-02-13 11:21:53:

You're completely right; collections.defaultdict does just this. I'll have to remember the collections module for future usage in general.

(Most of my large Python programming was several years ago, so a bunch of my reflexes are still partly back around Python 2.2 or so. Someday I need to do something moderately big with modern Python, so all of these neat new things stick in my mind.)

From at 2009-02-15 21:06:35:

I'll note that ruby has this problem solved easily:

irb(main):001:0> h = {|h,k|h[k]=[]}
=> {}
irb(main):002:0> h[5] += [4,2,1,"asdf"]
=> [4, 2, 1, "asdf"]
irb(main):003:0> h[2] += [1,2]
=> [1, 2]
irb(main):004:0> h
=> {5=>[4, 2, 1, "asdf"], 2=>[1, 2]}

You can even get strangely recursive and make an autohash:

irb(main):005:0> a ={|h,k| h[k] &h.default_proc}
=> {}
irb(main):006:0> a[2][1]=2
=> 2
irb(main):007:0> a[2][2][3]=4
=> 4
irb(main):008:0> a[3][1][1][1]=1
=> 1
irb(main):009:0> a
=> {2=>{1=>2, 2=>{3=>4}}, 3=>{1=>{1=>{1=>1}}}}

Note that the autohash that you can twist around and give ruby is what perl has all the time, by default; this:

sub addstuff {
  my($h, $k, @a) = @_;   # hash, key, stuff-to-add
  push @{$h->{$k}}, @a;

will just do the right thing, and cause array refs to spring into being as necessary.

From at 2011-06-10 07:34:28:

foo.setdefault(bar, []) is equivalent to foo.setdefault(bar, list()), so it it also constructs objects needlessly. Following defaultdict, setdefault should accept a function, so that you can do foo.setdefault(bar, list) or foo.setdefault(bar, lambda: bar / 2 + 1). -- Andreas

By cks at 2011-06-10 10:54:53:

The difference between the two cases is that construction of new empty lists and dictionaries is, as I understand it, quite optimized in CPython and is thus quite cheap and fast. Construction of arbitrary Python objects is much more expensive.

(Part of the reason that [] and {} can be so cheap is that they are language primitives.)

By cks at 2011-06-12 01:14:26:

More on how cheap [] is is now in CheapListDictTupleCreation.

Written on 13 February 2009.
« Recognizing non-interactive shells and 'shell levels'
Some of my assumptions about Python object allocation »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Feb 13 00:39:02 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.