|
2012-04-09 Understanding hashing in PythonI've talked about the The motivation for hash identities is what I'll call the dictionary key problem. Ignoring the actual implementation of dictionaries for a moment, the core operation in dictionaries is looking up an entry based on a key. If the only thing dictionaries could do with keys was an equality check (is the user-supplied key the same as the key for this entry), such lookups would be extremely inefficient; for example, you would have to do an equality check on every key in the dictionary to find out that you didn't have an entry for a particular key. Hash identities are the answer. Instead of checking for equality with every entry's key, dictionaries only have to check the keys with the same hash identity. If things work out (ie, there are only a few keys in the dictionary with the same hash identity), this will be a lot fewer entries and thus the lookup will run much faster. (Fast lookups are useful for other data structures in addition to dictionaries. For example, sets use hash identities for this purpose, although I don't know if it's explicitly documented.) That's nicely theoretical, but there's more. Python dictionaries are hash tables, which means that they use hash identities for more than speeding up key lookups; hash identities are how the internal data structures are organized. Hash identities are literally how dictionaries find entries. This means that hash identities for keys in a dictionary can never change. If they do change, the dictionary probably can't find the entry any more. This is one of the few ways of making a Python dictionary malfunction. (If this happens, the only way to get the items again is with the
This leads to the two rules of hash identities:
As covered earlier, plain equality comparisons are kind of a minefield for hash identities. This is because there are really three sorts of objects: unhashable objects, objects with the default hash identity, and objects with custom hash identities. Unhashable objects have a class A(object): __hash__ = None Such objects can't be used as keys in dictionaries, put into sets, or anything else that wants to use hash identities; if you try, you'll get an exception about it being unhashable. Because they're unhashable, these objects can have whatever equality comparison they want to. (It looks like defining some of the comparison methods but not a
Objects with the default hash identity have their hash being (Even in non-CPython implementations, Python is going to guarantee that an object's default hash identity doesn't change over the lifetime of your program. Doing otherwise breaks too many things.) Of course the problem with objects with the default hash identity is that they often aren't really useful as dictionary keys. For useful dictionary keys, we usually want this to work: a = AnObject(1, 2) dct[a] = 10 b = AnObject(1, 2) assert(a is not b and a == b) print dct[b] To make this work, (Note that this holds even for classes with a default hash identity, if you mix them as keys in a dictionary with keys that have a custom hash identity.) The rule of thumb for what goes into a custom hash identity is clear:
it's whatever makes (There are a number of ways to violate this rule without having your dictionaries explode, but I think that all of them are going to result in peculiar results. So don't do that.) When writing a custom hash function it's important to follow the usual rule for hash functions that the output values should be evenly distributed across the output range (given the expected input values, whatever they are for your objects). In Python, you should be evenly distributed across at least 31-bit integer values. Sidebar: hash identities and long integers
(Note that the conversion is not as simple as truncating your long integer to its lowest N bits, where N is 32 or 64 or whatever depending on the CPython platform.)
|
These are my WanderingThoughts GettingAround This is part of CSpace, and is written by ChrisSiebenmann. * * * Atom feeds are available; see the bottom of most pages. Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web |