Strings in Python 2 and Python 3
This started life as a reply to a comment on my entry about my issues with Unicode in Python 3 but grew, so I'm making it into an entry of its own. A commentator wrote:
If you want to read a sequence of bytes -- from, say, a file -- you can do that in Python 3. You just have to explicitly ask for it, and the datatype you get back will not be str. It shouldn't be! A str is meant to represent an abstract sequence of characters, and bytes are not that.
I disagree with this view of
str and strings. What strings represent
is a (subjective) language design decision, not the universal answer
presented here. Python 3 chooses to say that strings and
represent Unicode code points, while Python 2 and plenty of other
languages have decided that they represent raw bytes. Neither is right
or wrong, although the second is both less abstract and far more common.
(Note that Unicode code points are both more and less than abstract characters; the two are definitely not the same thing.)
What Python 2 used to do was read in sequences of bytes and decode them for you, assuming they were ASCII-encoded. That led to oodles of problems where people would write code that worked fine until they received a non-ASCII character, and then crash horribly.
This is not what Python 2 did at all; if anything, it's more a description of how Python 3 works, since Python 3 really wants to automatically decode things to Unicode the moment your program looks at them. Both Python 2 and Python 3 use your locale's encoding as the default character encoding, not ASCII.
(ASCII comes into it because people operating in the C locale get ASCII as their character encoding, at least in CPython, and you wind up in this locale if your locale information is unset.)
The general difference between Python 2 and Python 3 is in two things.
First, Python 3's interfaces normally all return Unicode strings and
Python 2's interfaces normally return (byte) strings; for example, if
.read() from a normally opened file you get back a byte string
in Python 2 and a Unicode string in Python 3.
Second, Python 2 will try to convert byte strings to Unicode strings
if you try to do something that combines the two and Python 3 will
not (you'll get various error messages about being unable to mix
bytes and str). Note that both Python 2 and Python 3 will try to
convert back and forth between Unicode and bytes if you're trying
to interact with the outside world with Unicode. If anything Python 3
does more automatic conversions here because more of its interfaces
with the outside world default to using Unicode.
(This means that quite a lot of operations can raise UnicodeDecodeError in Python 3, which has consequences for any code that believes it's handling all file IO errors by catching EnvironmentError.)
Python 2 code works fine with random non-ASCII characters if you don't ever try to convert things to Unicode (I have plenty of code like this). What trips people up is mixing Unicode and non-Unicode strings because then you have bytestrings being decoded to Unicode at random times where you didn't realize it (and so didn't catch decoding errors).
Python 3 solves this problem by force majeure, in that it no longer does these automatic up-conversions. If it had been content to stop there things would be fine; instead, it decided to also add a lot more automatic conversions (for various reasons). These automatic conversions are just as problematic as before but have the minor improvement that they now occur mostly at the boundaries of your program instead of at random points throughout it.
In other words, the failure points were still there in Python 2. They were just implicitly called instead of explicitly.
As should now be clear, I strongly disagree with this. It takes a significant amount of effort to use Python 3 without implicit failure points and is in fact relatively unnatural, while it's easy to use Python 2 without them.