Python 3 forced its own hand so that standard input had to be Unicode

October 31, 2021

In a comment on my entry on dealing with bad characters in stdin, Peter Donis said:

I've always thought it was a bad idea for Python 3 to make the standard streams default to Unicode text instead of bytes. [...]

I'm sure this was a deliberate design choice on the part of the Python developers, but they tied their own hands so that it would have been infeasible in early versions of Python 3 to make sys.stdin bytes instead of a Unicode stream. The problem is that the initial version of the bytes type was fairly minimal, and in particular the bytes type did not have any formatting operator until Python 3.5 added the traditional % formatting through PEP 461.

(Even today bytes doesn't have a .format() method. I'm honestly surprised that the Python 3.0 bytes type had as many string methods as it did.)

Since Python 3.0 bytes are basically the Python 2 str type, this pretty much has to have been deliberately removed code, not code that the Python developers didn't write. As part of the philosophy of Python 3, they decided that you should only be able to do what the PEP calls 'interpolation' on Unicode strings, not on un-decoded bytes.

Without a formatting operation on bytes, you can't really do too much with them in Python 3.0 other than turn them into Unicode strings. You certainly can't do the kind of stream processing of standard input (and writing to standard output) that's normal for a lot of filter style Unix programs. In this environment, making sys.stdin return bytes instead of Unicode strings is only going to annoy people. It's also asymmetric with sys.stdout and sys.stderr, again partly because of formatting. Since you can only format Unicode strings and people are going to want to format quite a lot of what they print out, those pretty much have to accept Unicode strings. Unless you want to make Unicode strings automatically convert to bytes, this pushes you to sys.stdout and sys.stderr being text, not bytes.

(Python 3 has to do this automatic conversion from Unicode to bytes on output somewhere, but doing ot when the actual IO happens and making sys.stdout be (Unicode) text is more readily understood. Then the magic conversion is in the OS specific magic layer.)

All of this fits with Python 3's general philosophy, of course. Python 3 really wants the world of text to be Unicode, and that includes input and output. Providing standard input as bytes and making it easy to process those bytes without ever turning them into Unicode would invite a return to the Python 2 world where people processed text in non-Unicode ways. Arguably, Unicode text processing is the reason for Python 3 to exist, so it's not surprising that the Python developers were so strongly against anything that smelled like it.


Comments on this page:

I agree that the Python 3 developers really wanted the world of text to be Unicode. The problem is that the programmer still has to translate between "the world of text" that the program processes internally and the stuff that comes in from and goes out to the rest of the world. And that means the programmer needs to have control over those I/O operations.

With file objects, that's not a problem, because the programmer controls when they get opened, and they have an encoding parameter, so the programmer can always control the translation. But the standard streams are not like that: the programmer doesn't control when they get opened, and (until Python 3.9--way, way, way too long) can't control their encoding. So not only does the programmer now see UnicodeEncodeDecodeError where no error at all used to be thrown, he has no way of fixing his program to avoid it (at least not without extremely ugly hacks), because the error is not in the program, it's deep in the bowels of the Python interpreter where the standard streams do their implicit decode/encode operations. Ironically, the Python 3 developers violated one of the key items in the Zen of Python--explicit is better than implicit--in what was, as you note, probably the main reason Python 3 existed in the first place.

One suspicion I've had ever since the Python 3 transition started is that the "all text is Unicode" paradigm was driven by Windows developers. Windows tries very hard to maintain the illusion that the stuff coming in from the outside world is actually Unicode, whereas Unix does not.

By Phillip at 2021-10-31 03:15:50:

This blog article from 2014 (and a follow up one) summarize the Unicode mess quite nicely, imho: https://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/

By Perry Lorier at 2021-10-31 08:00:00:

Java (IMHO) made the same mistake. Originally all filenames were unicode. So if you wanted to get a directory listing and there were non-unicode filenames in there they would get skipped.

This meant that you couldn't safely implement rm -rf in java. So java had to shell out to run /bin/rm -rf directly to reliably clean up a directory tree. I ran into this because the java code I was using had a security manager that checked that executables it ran were not world writable. When /bin/rm became a symlink to /usr/bin/rm it would break, because the security manager did a lstat(2) instead of stat(2) to verify this.

The outside world needs to be "bytes", and then you need a quick and easy way to convert them to unicode (if thats what you want to do with them), or you just treat them as "bytes" and don't touch them.

Sigh.

By Joseph at 2021-10-31 08:08:15:

This is why I argue that python’s approach was fundamentally flawed. The far better approach is to treat the outside world as bytes and then provide facilities in the standard library to handle UTF-8 where appropriate. I think this is one area that go got mostly right.

https://go.dev/blog/strings

Which is not surprising since rob pike is one of the creators of UTF-8.

From 193.219.181.219 at 2021-10-31 09:37:36:

Every time I have to deal with binary stdio in Python 3, I tell myself, "it could have been worse – it could've been like Perl". (Perl is pretty much the complete opposite of how Python 3 handled it – there's just one string type for both situations and it can can consist of bytes or 'wide' Unicode characters, or a mix of both if you accidentally concatenate two strings of different flavours, and the result will end up displaying wrong no matter what you do with it. This has caused so many headaches...)

But the standard streams are not like that: the programmer doesn't control when they get opened, and (until Python 3.9--way, way, way too long) can't control their encoding

No, in all Python 3.x versions you could get binary stdio via sys.stdin.detach() and re-wrap it as desired, it was just a bit more annoying than the new reconfigure() that we have now.

Java (IMHO) made the same mistake. Originally all filenames were unicode. So if you wanted to get a directory listing and there were non-unicode filenames in there they would get skipped.

Wow, that sounds seriously unpleasant.

Now I checked Python's os.listdir out of curiosity, and fortunately it handles this situation reasonably well: by default it'll implicitly use "surrogate escape" decoding for non-UTF-8 filenames, but if you pass the path as bytes all returned filenames will be raw bytes as well.

By cks at 2021-11-01 16:23:51:

One problem with using bytes for everything is that operating system filenames are not always bytes. Some Unix filesystems require UTF-8 (sometimes with some form of normalization), and I believe that Windows is natively UTF-16. Also, I believe that the Windows console is similarly at least sometimes natively UTF-16 too, which affects stdin, stout, and stderr when your program is running interactively.

The whole situation is a mess.

Written on 31 October 2021.
« Why browsers are driven to offer some degree of remote control
Linux puts a bunch of DMI information into sysfs for you »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Sun Oct 31 00:09:33 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.