Link: Armin Ronacher's 'More About Unicode in Python 2 and 3'
Armin Ronacher's More About Unicode in Python 2 and 3 contains a lot of information about the subject from someone who works with this stuff and so is much better informed about it in practice than I am. A sample quote:
I will use this post to show that from the pure design of the language and standard library why Python 2 the better language for dealing with text and bytes.
Since I have to maintain lots of code that deals exactly with the path between Unicode and bytes this regression from 2 to 3 has caused me lots of grief. Especially when I see slides by core Python maintainers about how I should trust them that 3.3 is better than 2.7 makes me more than angry.
I learned at least two surprising things from reading this. The first was that I hadn't previously realized that string formatting is not available for bytes in Python 3, only for Unicode strings. The second is that Mercurial has not and is not being ported to Python 3. As Ronacher notes, it turns out that these two issues are not unrelated.
For me, the lack of formatting for bytes adds another reason for not using Python 3 even for new code because it forces me into more Unicode conversion even if I know exactly what I'm doing with those unconverted bytes. Since I use Unix, with its large collection of non-Unicode byte APIs, there are times when this matters.
(For instance, it is perfectly sensible to manipulate Unix file paths as bytes without trying to convert them to Unicode. You can split them into path components, add prefixes and suffixes, and so on all without having to interpret the character sets of the file name components. In fact, in degenerate situations the file name components may be in different character sets, with a directory name in UTF-8 and file name inside a subdirectory in something else. At that point there is no way to properly decode the entire file path to meaningful Unicode. But I digress from Armin Ronacher's article.)
|
|