python/StringsPython2And3 written at 02:02:30; Add Comment
Strings in Python 2 and Python 3
This started life as a reply to a comment on my entry about my issues with Unicode in Python 3 but grew, so I'm making it into an entry of its own. A commentator wrote:
I disagree with this view of
(Note that Unicode code points are both more and less than abstract characters; the two are definitely not the same thing.)
This is not what Python 2 did at all; if anything, it's more a description of how Python 3 works, since Python 3 really wants to automatically decode things to Unicode the moment your program looks at them. Both Python 2 and Python 3 use your locale's encoding as the default character encoding, not ASCII.
(ASCII comes into it because people operating in the C locale get ASCII as their character encoding, at least in CPython, and you wind up in this locale if your locale information is unset.)
The general difference between Python 2 and Python 3 is in two things.
First, Python 3's interfaces normally all return Unicode strings and
Python 2's interfaces normally return (byte) strings; for example, if
(This means that quite a lot of operations can raise UnicodeDecodeError in Python 3, which has consequences for any code that believes it's handling all file IO errors by catching EnvironmentError.)
Python 2 code works fine with random non-ASCII characters if you don't ever try to convert things to Unicode (I have plenty of code like this). What trips people up is mixing Unicode and non-Unicode strings because then you have bytestrings being decoded to Unicode at random times where you didn't realize it (and so didn't catch decoding errors).
Python 3 solves this problem by force majeure, in that it no longer does these automatic up-conversions. If it had been content to stop there things would be fine; instead, it decided to also add a lot more automatic conversions (for various reasons). These automatic conversions are just as problematic as before but have the minor improvement that they now occur mostly at the boundaries of your program instead of at random points throughout it.
As should now be clear, I strongly disagree with this. It takes a significant amount of effort to use Python 3 without implicit failure points and is in fact relatively unnatural, while it's easy to use Python 2 without them.
* * *
Atom feeds are available; see the bottom of most pages.