How collections.defaultdict
is good for your memory usage
There is a classical pattern in code that uses entries in dictionaries to accumulate data. In the simplest form, it looks like this:
e = dct.get(ky, None) if e is None: e = [] dct[ky] = e # now we work on e without # caring if it's new or old
There is an obvious variation of this that gets rid of the whole
bureaucracy involving the if
:
e = dct.setdefault(ky, []) # work on e
On the surface, this looks very much like what you get with
collections.defaultdict
.
At this level you might reasonably think that defaultdict
is just
a convenience, giving you a slightly shorter and nicer way to write
this code so you don't have to do either the if
or use .setdefault()
instead of just doing a simple dct[ky]
. However, there's an
important way that both defaultdict
and the if
-based version
are better than the .setdefault()
version.
To see it, let's change what the individual elements are:
e = dct.setdefault(ky, ExpensiveItem()) ....
When I write things this way, the problem may jump out right away.
The issue with this version is that we always create a new
ExpensiveItem
object regardless of whether ky
is already in
dct
. If ky
is not in dct
, we use the new object and all is
good, but if there already is one, we throw away the new object we
created. If we're dealing with a lot of keys that already exist,
this is a lot of objects being created and then immediately thrown
away. Both the if
-based version and defaultdict
avoid this
problem because they only create a new object if and when they
actually need it, and a defaultdict
version is just as short as
the .setdefault()
version.
(The other subtle advantage of defaultdict
is that you specify
the default item only once, when you create the dictionary, instead
of having to duplicate it in every section of code where you need
to do this update-or-add pattern.)
On the one hand, this advantage of defaultdict
feels obvious once
I write it out like this. On the other hand, Python doesn't really
encourage people to think about how often objects are created and
other aspects of memory churn. Also, even if you know about the
issue (as I generally do), it's tempting to go with the setdefault()
version instead of the if
version just because it's shorter and
you probably aren't dealing with enough objects for this to matter.
Using collections.defaultdict
lets you have your cake and eat it
too; you get short code and memory efficiency.
|
|