Understanding hashing in Python

April 9, 2012

I've talked about the __hash__ special method a fair bit in here (and a note in the middle of this), but today it's time to actually write enough background so that people can actually understand why you might want to write a special method for it, what you need to do, and what the whole story is. The first and biggest question is simple; why is it interesting to give objects what I called a 'hash identity'?

The motivation for hash identities is what I'll call the dictionary key problem. Ignoring the actual implementation of dictionaries for a moment, the core operation in dictionaries is looking up an entry based on a key. If the only thing dictionaries could do with keys was an equality check (is the user-supplied key the same as the key for this entry), such lookups would be extremely inefficient; for example, you would have to do an equality check on every key in the dictionary to find out that you didn't have an entry for a particular key. Hash identities are the answer. Instead of checking for equality with every entry's key, dictionaries only have to check the keys with the same hash identity. If things work out (ie, there are only a few keys in the dictionary with the same hash identity), this will be a lot fewer entries and thus the lookup will run much faster.

(Fast lookups are useful for other data structures in addition to dictionaries. For example, sets use hash identities for this purpose, although I don't know if it's explicitly documented.)

That's nicely theoretical, but there's more. Python dictionaries are hash tables, which means that they use hash identities for more than speeding up key lookups; hash identities are how the internal data structures are organized. Hash identities are literally how dictionaries find entries. This means that hash identities for keys in a dictionary can never change. If they do change, the dictionary probably can't find the entry any more. This is one of the few ways of making a Python dictionary malfunction.

(If this happens, the only way to get the items again is with the .items() method. You can't use the key to delete or retrieve the item, even if you get the key from the dictionary with .keys().)

This leads to the two rules of hash identities:

  • hash(a) should never change.
  • if a and b should be considered the same item, a == b and hash(a) == hash(b) should both be true.

    (The simplified but often accurate version of this is that a == b should usually mean that hash(a) == hash(b).)

As covered earlier, plain equality comparisons are kind of a minefield for hash identities. This is because there are really three sorts of objects: unhashable objects, objects with the default hash identity, and objects with custom hash identities.

Unhashable objects have a __hash__ of None:

class A(object):
  __hash__ = None

Such objects can't be used as keys in dictionaries, put into sets, or anything else that wants to use hash identities; if you try, you'll get an exception about it being unhashable. Because they're unhashable, these objects can have whatever equality comparison they want to.

(It looks like defining some of the comparison methods but not a __hash__ can also make your objects unhashable, but I'm not clear about the exact circumstances here. The documentation claims that defining __eq__ or __cmp__ but not __hash__ is enough to make your objects unhashable, but this is not what I see in Python 2.7.2.)

Objects with the default hash identity have their hash being id(), ie their unique (current) memory address. If you're using only such objects as keys in a dictionary, it's guaranteed that no two keys will have the same hash identity and so it doesn't matter if two keys would compare equal to each other (the dictionary will never make that equality check). It is thus generally safe to make such objects have 'broad' equality comparisons, comparisons that are willing to say that the object is equal to, say, a number or a string.

(Even in non-CPython implementations, Python is going to guarantee that an object's default hash identity doesn't change over the lifetime of your program. Doing otherwise breaks too many things.)

Of course the problem with objects with the default hash identity is that they often aren't really useful as dictionary keys. For useful dictionary keys, we usually want this to work:

a = AnObject(1, 2)
dct[a] = 10
b = AnObject(1, 2)
assert(a is not b and a == b)
print dct[b]

To make this work, AnObject needs a custom hash identity so that a and b hash to the same value. A lot of normal Python types have custom hash identities; for example, numbers and strings of all sorts have custom hash identities (numbers hash to themselves, for example). The moment you use things with custom hash identities as dictionary keys, you can have hash identity collisions. This is where you want to be very certain that your equality comparisons really mean 'a is the same thing as b', which means that you should not have an equality comparison that is willing to declare your object equal to an object of another class (including to numbers, strings, and so on).

(Note that this holds even for classes with a default hash identity, if you mix them as keys in a dictionary with keys that have a custom hash identity.)

The rule of thumb for what goes into a custom hash identity is clear: it's whatever makes a be the same thing as b. Given that the hash identity of objects can't change over their lifetime (otherwise dictionaries explode), this implies that those parts that create equality can't change either. This creates the usual Python rule that objects with custom hash identities should be immutable once created.

(There are a number of ways to violate this rule without having your dictionaries explode, but I think that all of them are going to result in peculiar results. So don't do that.)

When writing a custom hash function it's important to follow the usual rule for hash functions that the output values should be evenly distributed across the output range (given the expected input values, whatever they are for your objects). In Python, you should be evenly distributed across at least 31-bit integer values.

Sidebar: hash identities and long integers

__hash__ is documented as returning an integer. Given the unification between integers and long integers, you might wonder if (C)Python really means 'a machine integer' or 'any integer, even a long integer'. The answer is that as of Python 2.5, CPython will accept a __hash__ that returns a long integer but silently converts the returned value to some smaller value. Since the conversion is unpredictable, I think that in practice you should confine the values that your __hash__ returns to a safe range; today, the safest range is probably 31-bit integers.

(Note that the conversion is not as simple as truncating your long integer to its lowest N bits, where N is 32 or 64 or whatever depending on the CPython platform.)

Written on 09 April 2012.
« My story of running scripts from the web
Revisiting checksum functions »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Apr 9 01:05:50 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.