Are Python dictionaries necessarily constant-time data structures?
The general view of all forms of hash tables, Python's dictionaries included, is that they are essentially constant-time data structures under normal circumstances. This is not quite true under sufficiently perverse conditions where you have a high degree of collisions in the hashes of the keys, but let's assume that you don't have that for the moment. Ignoring hash collisions, can you treat dictionaries as fast constant-time data structures?
The answer is 'not always', and the path to it has some interesting
and perhaps surprising consequences. The potential problem is
custom hash functions. If the objects
you're using as dictionary keys have a __hash__
method, this
method must be called to return the object hash. This is Python
code, so it's not necessarily going to be fast by comparison with
regular dictionary operations. It may also take a visibly non-constant
amount of time, depending on just how it's computing the hash.
(For instance, hashing even Python strings is actually not a constant time operation; it's linear with the length of the string. It's just that all of the code is in C, so by normal standards you're never going to notice the time differences.)
One of the things that you may want to consider as a result of this
is memoizing the results of any expensive __hash__
method. This
is what Python strings do; if you call hash()
on a string, the
actual hash computation is done only once and afterwards the already
computed (and saved) hash value is just repeated back to you. This
only works for things with immutable hash values, but then if your
objects have hash values at all they should be immutable ones.
The real answer is that all of this is relatively theoretical. I'm
pretty sure that almost no one uses complex custom __hash__
functions for objects defined in Python, although it seems relatively
common to define simple ones that just delegate to the hash of
another object (probably mostly or always a primitive object with
a fast C level hash function). And if you do have objects with
complex __hash__
functions that take noticeable amounts of time,
you're probably not going to be using them as dictionary keys or
set members very often because if you do, well, you'll notice.
On the other hand, the amount of work that the standard library's
decimal.Decimal
does in its __hash__
function is a little alarming (especially
in Python 2.7). Having looked, I wouldn't encourage using them as
dictionary keys or set members any time soon, at least not in
high-volume dictionaries or sets. The Python 3 version of datetime is another potentially
interesting case, since it does a certain amount of grinding away in
Python __hash__
functions.
(In Python 2.7, datetime is a C-level module so all of its hashing operations presumably go really fast in general.)
Sidebar: Custom hashes and the Global Interpreter Lock
Let's ask another question: is adding a new key and value to a
dictionary an atomic operation that's inherently protected by the
GIL? After all, the key might have a custom
__hash__
function that runs Python code (and thus bytecode) during
any dictionary operation. As far as I can tell from peering at the
CPython code, the answer is more or less yes. Although dictionary or
set operations may require calling Python code for __hash__
(and
for that matter for custom __eq__
methods as well), this is all
done conceptually 'before' the actual dictionary modification takes
place. The actual modification happens all at once, so you'll never see
a dictionary with eg a key set but not its corresponding value.
This does mean that writing 'dct[ky] = val
' may involve much more
Python bytecode running than you expect (and thus a much higher
chance that Python switches to another thread before the new key
and value are added to the dictionary). But it's always been the
case that Python might switch to another thread at almost any
bytecode, so this hasn't created a new race, just widened the time
window of an existing one you already had.
|
|