Python 3 forced its own hand so that standard input had to be Unicode
In a comment on my entry on dealing with bad characters in stdin, Peter Donis said:
I've always thought it was a bad idea for Python 3 to make the standard streams default to Unicode text instead of bytes. [...]
I'm sure this was a deliberate design choice on the part of the
Python developers, but they tied their own hands so that it would
have been infeasible in early versions of Python 3 to make sys.stdin
bytes instead of a Unicode stream. The problem is that the initial
version of the bytes
type was fairly minimal, and in particular
the bytes
type did not have any formatting operator until Python
3.5 added the traditional %
formatting through PEP 461.
(Even today bytes
doesn't have a .format()
method. I'm honestly
surprised that the Python 3.0 bytes
type had as many string methods
as it did.)
Since Python 3.0 bytes
are basically the Python 2 str
type,
this pretty much has to have been deliberately removed code, not
code that the Python developers didn't write. As part of the
philosophy of Python 3, they decided that you should only be able
to do what the PEP calls 'interpolation' on Unicode strings, not
on un-decoded bytes.
Without a formatting operation on bytes, you can't really do too
much with them in Python 3.0 other than turn them into Unicode
strings. You certainly can't do the kind of stream processing of
standard input (and writing to standard output) that's normal for
a lot of filter style Unix programs. In this environment, making
sys.stdin
return bytes instead of Unicode strings is only going
to annoy people. It's also asymmetric with sys.stdout
and
sys.stderr
, again partly because of formatting. Since you can
only format Unicode strings and people are going to want to format
quite a lot of what they print out, those pretty much have to accept
Unicode strings. Unless you want to make Unicode strings automatically
convert to bytes, this pushes you to sys.stdout
and sys.stderr
being text, not bytes.
(Python 3 has to do this automatic conversion from Unicode to
bytes on output somewhere, but doing ot when
the actual IO happens and making sys.stdout
be (Unicode) text
is more readily understood. Then the magic conversion is in the
OS specific magic layer.)
All of this fits with Python 3's general philosophy, of course. Python 3 really wants the world of text to be Unicode, and that includes input and output. Providing standard input as bytes and making it easy to process those bytes without ever turning them into Unicode would invite a return to the Python 2 world where people processed text in non-Unicode ways. Arguably, Unicode text processing is the reason for Python 3 to exist, so it's not surprising that the Python developers were so strongly against anything that smelled like it.
|
|