Python 3 forced its own hand so that standard input had to be Unicode

October 31, 2021

In a comment on my entry on dealing with bad characters in stdin, Peter Donis said:

I've always thought it was a bad idea for Python 3 to make the standard streams default to Unicode text instead of bytes. [...]

I'm sure this was a deliberate design choice on the part of the Python developers, but they tied their own hands so that it would have been infeasible in early versions of Python 3 to make sys.stdin bytes instead of a Unicode stream. The problem is that the initial version of the bytes type was fairly minimal, and in particular the bytes type did not have any formatting operator until Python 3.5 added the traditional % formatting through PEP 461.

(Even today bytes doesn't have a .format() method. I'm honestly surprised that the Python 3.0 bytes type had as many string methods as it did.)

Since Python 3.0 bytes are basically the Python 2 str type, this pretty much has to have been deliberately removed code, not code that the Python developers didn't write. As part of the philosophy of Python 3, they decided that you should only be able to do what the PEP calls 'interpolation' on Unicode strings, not on un-decoded bytes.

Without a formatting operation on bytes, you can't really do too much with them in Python 3.0 other than turn them into Unicode strings. You certainly can't do the kind of stream processing of standard input (and writing to standard output) that's normal for a lot of filter style Unix programs. In this environment, making sys.stdin return bytes instead of Unicode strings is only going to annoy people. It's also asymmetric with sys.stdout and sys.stderr, again partly because of formatting. Since you can only format Unicode strings and people are going to want to format quite a lot of what they print out, those pretty much have to accept Unicode strings. Unless you want to make Unicode strings automatically convert to bytes, this pushes you to sys.stdout and sys.stderr being text, not bytes.

(Python 3 has to do this automatic conversion from Unicode to bytes on output somewhere, but doing ot when the actual IO happens and making sys.stdout be (Unicode) text is more readily understood. Then the magic conversion is in the OS specific magic layer.)

All of this fits with Python 3's general philosophy, of course. Python 3 really wants the world of text to be Unicode, and that includes input and output. Providing standard input as bytes and making it easy to process those bytes without ever turning them into Unicode would invite a return to the Python 2 world where people processed text in non-Unicode ways. Arguably, Unicode text processing is the reason for Python 3 to exist, so it's not surprising that the Python developers were so strongly against anything that smelled like it.

Written on 31 October 2021.
« Why browsers are driven to offer some degree of remote control
Linux puts a bunch of DMI information into sysfs for you »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Oct 31 00:09:33 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.