Strings in Python 2 and Python 3

July 18, 2012

This started life as a reply to a comment on my entry about my issues with Unicode in Python 3 but grew, so I'm making it into an entry of its own. A commentator wrote:

If you want to read a sequence of bytes -- from, say, a file -- you can do that in Python 3. You just have to explicitly ask for it, and the datatype you get back will not be str. It shouldn't be! A str is meant to represent an abstract sequence of characters, and bytes are not that.

I disagree with this view of str and strings. What strings represent is a (subjective) language design decision, not the universal answer presented here. Python 3 chooses to say that strings and str should represent Unicode code points, while Python 2 and plenty of other languages have decided that they represent raw bytes. Neither is right or wrong, although the second is both less abstract and far more common.

(Note that Unicode code points are both more and less than abstract characters; the two are definitely not the same thing.)

What Python 2 used to do was read in sequences of bytes and decode them for you, assuming they were ASCII-encoded. That led to oodles of problems where people would write code that worked fine until they received a non-ASCII character, and then crash horribly.

This is not what Python 2 did at all; if anything, it's more a description of how Python 3 works, since Python 3 really wants to automatically decode things to Unicode the moment your program looks at them. Both Python 2 and Python 3 use your locale's encoding as the default character encoding, not ASCII.

(ASCII comes into it because people operating in the C locale get ASCII as their character encoding, at least in CPython, and you wind up in this locale if your locale information is unset.)

The general difference between Python 2 and Python 3 is in two things. First, Python 3's interfaces normally all return Unicode strings and Python 2's interfaces normally return (byte) strings; for example, if you do .read() from a normally opened file you get back a byte string in Python 2 and a Unicode string in Python 3. Second, Python 2 will try to convert byte strings to Unicode strings if you try to do something that combines the two and Python 3 will not (you'll get various error messages about being unable to mix bytes and str). Note that both Python 2 and Python 3 will try to convert back and forth between Unicode and bytes if you're trying to interact with the outside world with Unicode. If anything Python 3 does more automatic conversions here because more of its interfaces with the outside world default to using Unicode.

(This means that quite a lot of operations can raise UnicodeDecodeError in Python 3, which has consequences for any code that believes it's handling all file IO errors by catching EnvironmentError.)

Python 2 code works fine with random non-ASCII characters if you don't ever try to convert things to Unicode (I have plenty of code like this). What trips people up is mixing Unicode and non-Unicode strings because then you have bytestrings being decoded to Unicode at random times where you didn't realize it (and so didn't catch decoding errors).

Python 3 solves this problem by force majeure, in that it no longer does these automatic up-conversions. If it had been content to stop there things would be fine; instead, it decided to also add a lot more automatic conversions (for various reasons). These automatic conversions are just as problematic as before but have the minor improvement that they now occur mostly at the boundaries of your program instead of at random points throughout it.

In other words, the failure points were still there in Python 2. They were just implicitly called instead of explicitly.

As should now be clear, I strongly disagree with this. It takes a significant amount of effort to use Python 3 without implicit failure points and is in fact relatively unnatural, while it's easy to use Python 2 without them.


Comments on this page:

From 78.86.151.9 at 2012-07-18 09:37:49:

Looks like you hit a gotcha in your markup system in the paragraph starting "The general difference between ..."! A numbered list entry appears in the middle of what I assume should read "If anything Python 3 does more ...".

By cks at 2012-07-18 10:05:58:

Whoops, yes; thank you for letting me know. I've fixed it now. Making bare numbers be the start of numbered lists turns out to be one of the worst mistakes I made in DWikiText.

(It's almost annoying enough to make me take it out and fix up any content with affected markup. Almost. I hate rewriting markup.)

Written on 18 July 2012.
« Getting an Ubuntu 12.04 machine to give you boot messages
Unicode code points and abstract characters »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jul 18 02:02:30 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.