How collections.defaultdict is good for your memory usage

November 5, 2017

There is a classical pattern in code that uses entries in dictionaries to accumulate data. In the simplest form, it looks like this:

 e = dct.get(ky, None)
 if e is None:
    e = []
    dct[ky] = e

 # now we work on e without
 # caring if it's new or old

There is an obvious variation of this that gets rid of the whole bureaucracy involving the if:

e = dct.setdefault(ky, [])
# work on e

On the surface, this looks very much like what you get with collections.defaultdict. At this level you might reasonably think that defaultdict is just a convenience, giving you a slightly shorter and nicer way to write this code so you don't have to do either the if or use .setdefault() instead of just doing a simple dct[ky]. However, there's an important way that both defaultdict and the if-based version are better than the .setdefault() version.

To see it, let's change what the individual elements are:

e = dct.setdefault(ky, ExpensiveItem())
....

When I write things this way, the problem may jump out right away. The issue with this version is that we always create a new ExpensiveItem object regardless of whether ky is already in dct. If ky is not in dct, we use the new object and all is good, but if there already is one, we throw away the new object we created. If we're dealing with a lot of keys that already exist, this is a lot of objects being created and then immediately thrown away. Both the if-based version and defaultdict avoid this problem because they only create a new object if and when they actually need it, and a defaultdict version is just as short as the .setdefault() version.

(The other subtle advantage of defaultdict is that you specify the default item only once, when you create the dictionary, instead of having to duplicate it in every section of code where you need to do this update-or-add pattern.)

On the one hand, this advantage of defaultdict feels obvious once I write it out like this. On the other hand, Python doesn't really encourage people to think about how often objects are created and other aspects of memory churn. Also, even if you know about the issue (as I generally do), it's tempting to go with the setdefault() version instead of the if version just because it's shorter and you probably aren't dealing with enough objects for this to matter. Using collections.defaultdict lets you have your cake and eat it too; you get short code and memory efficiency.

Written on 05 November 2017.
« Some early notes on WireGuard
My new Linux machine for fall 2017 (planned) »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 5 23:56:04 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.