Wandering Thoughts archives

2012-04-29

The Python language tutorial is a gem

One of the things that my recent experience learning JavaScript has made me realize (or at least reminded me) is that the Python language tutorial is a gem. Specifically it's clearly a gem starting at section 3, An Informal Introduction to Python.

What the tutorial excels at is getting someone who already knows programming up to speed in Python very fast. It doesn't try to teach you everything (although it covers quite a lot) and it assumes that you can pick things up on the fly (eg, a certain amount of syntax only appears in the examples), but if you can keep up you can absorb a significant amount of Python in a few pages. Someone who really finishes the tutorial will have easily learned enough Python to do quite a lot of useful work. Oh, and the tutorial does all of this without glossing over complexities; the tutorial may not cover everything, but what it covers it covers accurately.

(The fastest way to get me to distrust any language introduction is to tell me something that I know is not true. I ran into several quick introductions to JavaScript that over-simplified its peculiar ===, for example.)

At a pragmatic level what the tutorial really is, where it's important, is that it's a great tool for turning general programmers into Python programmers; it makes it easy, almost effortless, to take a look in. If you have a not too large problem that you think Python might be the answer for, you can read the tutorial and know one way or another in an hour or two. In the process you'll have picked up enough to get a feel for the language and whether or not you particularly like it.

But pragmatism isn't everything. Beyond its usefulness for recruiting, the tutorial is simply a gem of teaching. Somehow it has managed to boil things down just right; it knows what to leave out, what to elide, and what to put into examples and let you follow along. If you look at it objectively the tutorial makes a an awful lot of courageous assumptions about what the reader can follow, but I firmly believe that it gets them right. Its boldness and willingness to assume intelligent interest on the reader's part is refreshing.

(Although I don't know the history of the tutorial, I salute the kind of authorial courage that I suspect it took to not make the tutorial slower, more verbose, and more timid. There's always an urge to put in an extended explanation just in case part of the audience didn't get it or didn't have the background you expected.)

Oh, and on a side note, one of the things that tutorial is also great for is quickly coming up to speed on major new Python features. I started on Python in the 1.5 era, and I'm pretty sure that I learned list comprehensions, generator expressions, and a few other things from reading new sections of the tutorial.

It vaguely boggles me that so few other languages seem to try to have any equivalent of the Python tutorial. In today's environment, many people who already know programming in general will be trying to pick up your language specifically; what they want is exactly this sort of quick tutorial to bring them up to speed. Yet I've barely seen any attempts in the various languages I've looked at over the years, even though it probably wouldn't be all that difficult to take the Python tutorial and rewrite the specific syntax details for your own language.

(This whole issue was brought to mind partly by not finding anything like this for JavaScript, even for the basic language. I found some crude, flawed attempts but none that I thought were half as good as the Python tutorial.)

PythonTutorialGem written at 22:47:08; Add Comment

2012-04-09

Understanding hashing in Python

I've talked about the __hash__ special method a fair bit in here (and a note in the middle of this), but today it's time to actually write enough background so that people can actually understand why you might want to write a special method for it, what you need to do, and what the whole story is. The first and biggest question is simple; why is it interesting to give objects what I called a 'hash identity'?

The motivation for hash identities is what I'll call the dictionary key problem. Ignoring the actual implementation of dictionaries for a moment, the core operation in dictionaries is looking up an entry based on a key. If the only thing dictionaries could do with keys was an equality check (is the user-supplied key the same as the key for this entry), such lookups would be extremely inefficient; for example, you would have to do an equality check on every key in the dictionary to find out that you didn't have an entry for a particular key. Hash identities are the answer. Instead of checking for equality with every entry's key, dictionaries only have to check the keys with the same hash identity. If things work out (ie, there are only a few keys in the dictionary with the same hash identity), this will be a lot fewer entries and thus the lookup will run much faster.

(Fast lookups are useful for other data structures in addition to dictionaries. For example, sets use hash identities for this purpose, although I don't know if it's explicitly documented.)

That's nicely theoretical, but there's more. Python dictionaries are hash tables, which means that they use hash identities for more than speeding up key lookups; hash identities are how the internal data structures are organized. Hash identities are literally how dictionaries find entries. This means that hash identities for keys in a dictionary can never change. If they do change, the dictionary probably can't find the entry any more. This is one of the few ways of making a Python dictionary malfunction.

(If this happens, the only way to get the items again is with the .items() method. You can't use the key to delete or retrieve the item, even if you get the key from the dictionary with .keys().)

This leads to the two rules of hash identities:

  • hash(a) should never change.
  • if a and b should be considered the same item, a == b and hash(a) == hash(b) should both be true.

    (The simplified but often accurate version of this is that a == b should usually mean that hash(a) == hash(b).)

As covered earlier, plain equality comparisons are kind of a minefield for hash identities. This is because there are really three sorts of objects: unhashable objects, objects with the default hash identity, and objects with custom hash identities.

Unhashable objects have a __hash__ of None:

class A(object):
  __hash__ = None

Such objects can't be used as keys in dictionaries, put into sets, or anything else that wants to use hash identities; if you try, you'll get an exception about it being unhashable. Because they're unhashable, these objects can have whatever equality comparison they want to.

(It looks like defining some of the comparison methods but not a __hash__ can also make your objects unhashable, but I'm not clear about the exact circumstances here. The documentation claims that defining __eq__ or __cmp__ but not __hash__ is enough to make your objects unhashable, but this is not what I see in Python 2.7.2.)

Objects with the default hash identity have their hash being id(), ie their unique (current) memory address. If you're using only such objects as keys in a dictionary, it's guaranteed that no two keys will have the same hash identity and so it doesn't matter if two keys would compare equal to each other (the dictionary will never make that equality check). It is thus generally safe to make such objects have 'broad' equality comparisons, comparisons that are willing to say that the object is equal to, say, a number or a string.

(Even in non-CPython implementations, Python is going to guarantee that an object's default hash identity doesn't change over the lifetime of your program. Doing otherwise breaks too many things.)

Of course the problem with objects with the default hash identity is that they often aren't really useful as dictionary keys. For useful dictionary keys, we usually want this to work:

a = AnObject(1, 2)
dct[a] = 10
b = AnObject(1, 2)
assert(a is not b and a == b)
print dct[b]

To make this work, AnObject needs a custom hash identity so that a and b hash to the same value. A lot of normal Python types have custom hash identities; for example, numbers and strings of all sorts have custom hash identities (numbers hash to themselves, for example). The moment you use things with custom hash identities as dictionary keys, you can have hash identity collisions. This is where you want to be very certain that your equality comparisons really mean 'a is the same thing as b', which means that you should not have an equality comparison that is willing to declare your object equal to an object of another class (including to numbers, strings, and so on).

(Note that this holds even for classes with a default hash identity, if you mix them as keys in a dictionary with keys that have a custom hash identity.)

The rule of thumb for what goes into a custom hash identity is clear: it's whatever makes a be the same thing as b. Given that the hash identity of objects can't change over their lifetime (otherwise dictionaries explode), this implies that those parts that create equality can't change either. This creates the usual Python rule that objects with custom hash identities should be immutable once created.

(There are a number of ways to violate this rule without having your dictionaries explode, but I think that all of them are going to result in peculiar results. So don't do that.)

When writing a custom hash function it's important to follow the usual rule for hash functions that the output values should be evenly distributed across the output range (given the expected input values, whatever they are for your objects). In Python, you should be evenly distributed across at least 31-bit integer values.

Sidebar: hash identities and long integers

__hash__ is documented as returning an integer. Given the unification between integers and long integers, you might wonder if (C)Python really means 'a machine integer' or 'any integer, even a long integer'. The answer is that as of Python 2.5, CPython will accept a __hash__ that returns a long integer but silently converts the returned value to some smaller value. Since the conversion is unpredictable, I think that in practice you should confine the values that your __hash__ returns to a safe range; today, the safest range is probably 31-bit integers.

(Note that the conversion is not as simple as truncating your long integer to its lowest N bits, where N is 32 or 64 or whatever depending on the CPython platform.)

UnderstandingHashing written at 01:05:50; Add Comment

2012-04-04

More on equality in Python (well, mostly Python 2.7)

There are undoubtedly some languages where if you create a new class, instances of the class don't support equality checks at all until you tell the language how to do this. Python is friendlier, so it gives all classes a default idea of ordinary equality, the equality of ==. This default is object identity; for an uncustomized class, 'a == b' is the same as 'a is b'. This is kind of minimal but it's hard to argue that an object is not equal to itself.

(There are some situations where this is not true, or at least not desired, but we won't go there for now.)

Python has two ways of customizing equality checks, the old one and the new one. The old one is to create a general __cmp__ method that returns the same sort of result as the cmp() builtin does. The new one is to define __eq__ and __ne__ special methods, which are called for == and != comparisons respectively. Note that you need to define both special methods; Python will not infer one from the other and will instead just default to object identity.

(Personally I disagree with this lack of inference. Defining __eq__ but not __ne__ is both an easy mistake to make and a great way to shoot yourself in the foot in a far from obvious way.)

Discussions of __eq__ and __ne__ generally lump them in with the other comparison special methods (this happens even in the official documentation) and some also imply that you should implement all of these special methods if you have any. In my opinion this is a mistake; there are plenty of situations where equality and inequality between entities is well defined but you do not necessarily have an order (or at least any order that you make up will be either arbitrary or very complex or both). For a concrete example, consider sets. Equality and inequality of sets is clear and obvious (it's whether they contain the same elements), but what does it mean to compare two unrelated sets to see if one is greater than or less than than the other? And how often are such comparisons going to actually be useful?

(Even if you can handwave an ordering, there is such a thing as being too clever.)

It's my strong opinion that you should not define comparison operators unless you can actually define an ordering for your class and make that ordering coherent with respect to equality. Let's use sets again. If we take 's1 > s2' to mean 's1 is bigger (has more elements) than s2', then we can come up with a straightforward implementation of the comparison methods. But now we have a problem, because we've created a situation where it's possible for 's1 < s2', 's1 > s2', and 's1 == s2' to all be False at once (s1 and s2 have the same number of elements but the elements are different). This is at least very odd and may have consequences for other code.

In Python it's legitimate to define __eq__ without defining __hash__ to have a custom hash function; 'a == b' does not imply or require that 'hash(a) == hash(b)'. There are many classes where it will not, for instance pretty much any sort of mutable thing.

Sidebar: default comparisons

In Python 2.x, class instances have a default idea of ordering that is based on id(); in an uncustomized class, 'a < b' means 'id(a) < id(b)'. This gives you a comparison result and allows you to do things like sort class instances into a consistent and stable ordering (at least in CPython), but it's not actually very meaningful (since id() is just the address in memory of the (C-level) instance object, which is highly unpredictable). In Python 3, this default has been eliminated and attempts to compare uncustomized class instances this way will get you a TypeError about unorderable types.

Note that there is no general guarantee that the id() results for a given object are stable over time. They are stable in CPython mostly because CPython does not relocate objects when it does garbage collection. Some other Python implementations do have relocating garbage collectors and so in them, 'id(obj)' may well change over time.

(The id() help text is very explicit that the id() result only has to be unique among all currently existing objects right now. Even in CPython, the id() of a newly created object may be the same as the old id() of an old and now garbage collected object. In short, id() is not a unique serial number for objects.)

EqualityNotes written at 01:16:17; Add Comment

2012-04-03

Python's two versions of equality, with a long digression on hash()

Python has three two versions of equality; in order of pickiness, they are == (plain equality) and is identity. When I started writing this entry I was going to say that there were three and the third was hash identity, but I am wrong about that. We'll talk about hash identity at the end.

Plain equality is actually surprisingly complicated if you want it to be. A class can control how all of the comparison operators work with special methods (aka 'magic methods'), and it can make them behave differently if it wants to. Or you can just define an old style __cmp__ method and be done with it. == can compare different types of objects and return useful results if desired. Two different objects can and often do compare as equal to each other.

Actual is identity, as in 'a is b', tests whether the two things are in fact the exact same object. This is not something that the class can control using special methods; it's directly implemented inside the interpreter (as is 'id()'). It's legitimate and common for 'a == b' to be true but 'a is b' to be false. On the other hand, sometimes it's true even if you got the objects from different places. Two different objects can never be is-identical to each other, pretty much by definition.

What I have called 'hash identity' is the value that 'hash(obj)' returns, if it returns anything. Hash identity is used in dictionaries and similar types to find things inside the dictionary, both to retrieve them and to check to see if the key you are trying to set already exists in the dictionary. An object's hash identity doesn't have to be unique and in fact isn't necessarily unique, which means that you can't directly use hash identity as an equality check (although, like dictionaries, you can use it to see if a full equality check is worth doing). A class can control the hash value of objects by defining a __hash__ special method, but you need to be very careful about how it's defined.

(I wouldn't be surprised if hash identities are also used in things like sets in order to enable relatively efficient 'a in set' checks and the like.)

You might ask why hash methods are necessary; the answer is that they enable your objects to be useful keys in dictionaries in some circumstances. Consider a complex number class (one where the numbers are immutable once created). Two complex number objects with the same real and imaginary parts should compare equal, and more than that if you do 'dct[Complex(1, -1)] = 10; print dct[Complex(1, -1)]' you should clearly get 10 instead of 'no such key'. In order to do this you need a custom hash function so that all Complex instances with the same real and imaginary values have the same hash value.

(Although it's commonly said that custom hash functions should only be defined for immutable objects, this is not quite the true requirement. You can have mutable objects with custom hash functions provided that the mutation you can do does not change either the hash function's result or the equality comparison. You may get some odd results from dictionary operations, but presumably you know what you're doing.)

Now we get into here be dragons territory: you should be very careful about combining custom hash functions with custom equality functions that are willing to do comparisons with things outside your class. The easiest way to show what can go wrong is with a simple example:

class Boom(object):
  def __init__(self, x):
    self.x = x
  def __eq__(self, other):
    if other == "YES":
      return True
    else:
      return other.x == self.x
  def __hash__(self):
    return self.x

d = {}
d[Boom(hash("YES"))] = 10
print d["YES"]

This should get a KeyError, because we sure didn't add a "YES" to the dictionary. But because our Boom() object both has the same hash value as "YES" and compares equal to it the dictionary thinks that it is the same key and says 'sure, here you go', and our example prints 10.

(You can argue that Python dictionaries should also check the class of the key objects they are comparing and only do the == comparison if they are of the same class, but the plain fact is that CPython doesn't behave this way today. There are probably obscure legitimate uses for this.)

PS: yes, this is one of the places where duck typing might go badly wrong (cf the limits of duck typing). Despite my usual opposition to isinstance() checks, this is a case where I might accept one. Remember to put in a comment about it.

(For more on the various special methods involved in this and what you can do with them, you can see the official documentation.)

TwoEqualitiesAndHash written at 00:05:34; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.