Are Python dictionaries necessarily constant-time data structures?

April 16, 2015

The general view of all forms of hash tables, Python's dictionaries included, is that they are essentially constant-time data structures under normal circumstances. This is not quite true under sufficiently perverse conditions where you have a high degree of collisions in the hashes of the keys, but let's assume that you don't have that for the moment. Ignoring hash collisions, can you treat dictionaries as fast constant-time data structures?

The answer is 'not always', and the path to it has some interesting and perhaps surprising consequences. The potential problem is custom hash functions. If the objects you're using as dictionary keys have a __hash__ method, this method must be called to return the object hash. This is Python code, so it's not necessarily going to be fast by comparison with regular dictionary operations. It may also take a visibly non-constant amount of time, depending on just how it's computing the hash.

(For instance, hashing even Python strings is actually not a constant time operation; it's linear with the length of the string. It's just that all of the code is in C, so by normal standards you're never going to notice the time differences.)

One of the things that you may want to consider as a result of this is memoizing the results of any expensive __hash__ method. This is what Python strings do; if you call hash() on a string, the actual hash computation is done only once and afterwards the already computed (and saved) hash value is just repeated back to you. This only works for things with immutable hash values, but then if your objects have hash values at all they should be immutable ones.

The real answer is that all of this is relatively theoretical. I'm pretty sure that almost no one uses complex custom __hash__ functions for objects defined in Python, although it seems relatively common to define simple ones that just delegate to the hash of another object (probably mostly or always a primitive object with a fast C level hash function). And if you do have objects with complex __hash__ functions that take noticeable amounts of time, you're probably not going to be using them as dictionary keys or set members very often because if you do, well, you'll notice.

On the other hand, the amount of work that the standard library's decimal.Decimal does in its __hash__ function is a little alarming (especially in Python 2.7). Having looked, I wouldn't encourage using them as dictionary keys or set members any time soon, at least not in high-volume dictionaries or sets. The Python 3 version of datetime is another potentially interesting case, since it does a certain amount of grinding away in Python __hash__ functions.

(In Python 2.7, datetime is a C-level module so all of its hashing operations presumably go really fast in general.)

Sidebar: Custom hashes and the Global Interpreter Lock

Let's ask another question: is adding a new key and value to a dictionary an atomic operation that's inherently protected by the GIL? After all, the key might have a custom __hash__ function that runs Python code (and thus bytecode) during any dictionary operation. As far as I can tell from peering at the CPython code, the answer is more or less yes. Although dictionary or set operations may require calling Python code for __hash__ (and for that matter for custom __eq__ methods as well), this is all done conceptually 'before' the actual dictionary modification takes place. The actual modification happens all at once, so you'll never see a dictionary with eg a key set but not its corresponding value.

This does mean that writing 'dct[ky] = val' may involve much more Python bytecode running than you expect (and thus a much higher chance that Python switches to another thread before the new key and value are added to the dictionary). But it's always been the case that Python might switch to another thread at almost any bytecode, so this hasn't created a new race, just widened the time window of an existing one you already had.

Comments on this page:

By Zev Weiss at 2015-04-16 14:11:23:

Well...but when people talk about an associative data structure (such as a hash table) being constant-time, it just means it's not dependent on the number of objects in the collection, which (unless you have a custom __hash__ going out of its way to do something really weird) is still going to be the case for python's dicts.

By cks at 2015-04-16 14:45:01:

In a computer science theoretical way, that is indeed what we mean when we talk about 'constant time' or O(1) algorithms like hash tables. In actual usage, though, most programmers expect things labeled as constant time to run with a flat and fast time profile unless you're throwing a lot at them. A factor of ten slowdown between different sorts of objects for key insertion is still 'constant time' in the computer science sense, but it will be surprising (and unwelcome) for many programmers using hash tables. So will noticeable slowdowns in key insertions just between different objects (as you might get with an in-Python algorithm that is not itself constant time across all objects).

(This is why I carefully said 'fast constant time' in the opening paragraph. Python programmers normally think of dicts not just as constant time but as fast on top of it. Having slow Python __hash__ methods can make dictionary operations both slow enough and variable enough to be noticed.)

Written on 16 April 2015.
« Illusory security is terrible and is worse than no security
In practice, programmers mostly understand complexity by superstition »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Apr 16 00:28:20 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.