2014-07-08
Exploring a surprise with equality in Python
Taken from @hackedy's tweet, here's an interesting Python surprise:
>>> {1: "one", True: "two"} {1: 'two'} >>> {0: "one", False: "two"} {0: 'two'}
There are two things happening here to create this surprise. The starting point is this:
>>> print hash(1), hash(True) 1 1
At one level, Python has made True
have the same hash value as 1
. Actually that's not quite right,
so let me show you the real story:
>>> isinstance(True, int) True
Python has literally made bool
, the type that True
and False
are instances of, be a subclass of int
. They not merely look like
numbers, they are numbers. As numbers their hash identity is their literal value of 1
or 0
, and
of course they also compare equal to literal 1
or 0
. Since they
hash to the same identity and compare equal, we run into the issue
with 'custom' equalities and hashes in dictionaries
where Python considers the two different objects to be the same key
and everything gets confusing.
(That True
and False
hash to the same thing as 1
and 0
is
probably not a deliberate choice. The internal bool
type doesn't
have a custom hash function; it just tacitly inherits the hash
function of its parent, ie int
. I believe that Python could change
this if it wanted to, which would make the surprise here go away.)
The other thing is what happens when you create a dictionary with
literal syntax, which is that Python generates bytecode that stores
each initial value into the dictionary one after another in the
order that you wrote them. It turns out that when you do a redundant
store into a dictionary (ie you store something for a key that
already exists), Python only replaces the value, not both the key
and the value. This is why the result is not '{True: 'two'}
';
only the value got overwritten in the second store.
(This decision is a sensible one because it may avoid object churn and the overhead associated with it. If Python replaced the key as well it would at least be changing more reference counts on the key objects. And under normal circumstances you're never going to notice the difference unless you're going out of your way to look.)
PS: It turns out that @hackedy beat me to discovering that bools
are ints.
Also the class documentation for bool
says this explicitly (and
notes that bool
is one of the rare Python classes that can't be
subclassed).
Some thoughts on SAN long-term storage migration
In theory one of the advantages having of a SAN instead of simple disk servers is relatively painless storage migration over the long term, where given a suitable setup and suitable software you can more or less transparently migrate data from old storage backends to new ones without any particular user-visible downtime. In practice I've come around to the idea that we may never be able to do this in our fileserver environment and that in general it takes a particular sort of environment to make it work.
Put simply, the disks you get today are generally going to be significantly different than the disks of four or five years ago, in both capacity and perhaps performance (if you go to SSDs). These differences may well cause you to want to reshape how your storage is laid out to do things like consolidate to fewer spindles (or to spread out to even more to preserve IOPs while growing space). So in order to have transparent storage migration you need not just a SAN but frontend storage software that can do this sort of potentially drastic rearrangement of storage. Replacing individual disks with bigger disks is something that almost every storage system can do, but that's the simple case. It's less common to have support for transformations like moving from two smaller disks to one bigger disk or moving from a few big disks to more smaller disks (as might happen if you went from HDs to SSDs).
Put another way, you often want to design different storage setups today than four or five years ago even if you keep the same top level technology (eg ZFS). Given this, a transparent migration either requires some way to transmogrify the storage setups from five years ago into the storage setups of today or just living with a five year old setup (which is often less than desirable).
While our hand was forced by ZFS this time around, this is certainly one thing that probably would have biased us towards a non-transparent migration anyways. Moving from 750 GB and 1TB disks to 2TB disks caused us to roughly double our standard chunk size, but politics mean we can't give people double the space for free and anyways some people have space split across multiple ZFS pools (on different fileservers) that we'd like to consolidate into one. Doing the necessary reshaping transparently would take features that ZFS just doesn't have. I suspect that we'll run into similar issues in four or so years when we next migrate to another set of hardware and disks; the disks of four or five years from now are hopefully going to be significantly different from today.
Our migration from our previous generation DiskSuite based fileservers to our ZFS ones was obviously forced to be non-transparent because we were moving to ZFS, but in retrospect it also had the same issue; we moved from 35 GB 'chunks' to much larger ones and that would have again requires support for major storage reshaping even if we'd stuck with DiskSuite.
(And in general we've been lucky that ZFS has remained a viable option for us across this many years. If Sun had not released ZFS and Solaris as open source we'd probably be migrating away from ZFS now, just as we migrated away from DiskSuite last time. This ties into the arrogance of trying to plan for long term storage management in the face of all of the uncertainty in the technology world.)