Things to do in Python 3 when your Unix standard input is badly encoded
Today I had a little adventure with Python 3. I have a program that takes standard input, reads and lightly processes a bunch of headers before writing them out, then just copies the body (of an email message, as it happens) from standard input to standard output. Normally it gets well formed input, with no illegally encoded UTF-8. Today, there were some stray bytes and the world blew up. Dealing with this was far harder than it should have been, partly because the documentation has issues.
Although the documentation for sys.stdin will not
tell you this, sys.stdin
most likely has the API of
io.TextIOBaseWrapper.
Otherwise, your only method for finding out what attributes and
methods it supports is the ever friendly 'help(type(sys.stdin))
'
in a Python interpreter. If you're on Python 3.7 or later, what
you probably want to do about a badly encoded standard input is
change how it handles encoding errors with .reconfigure()
:
sys.stdin.reconfigure(errors="surrogateescape")
Now that I've learned about this, I think that you should generally do this as the first operation in any Python 3 program that reads from standard input, unless you are absolutely sure that the input being not well-formed UTF-8 is a fatal error (it almost never is).
Unfortunately for me, Ubuntu 18.04 LTS has Python 3.6.9 as its
/usr/bin/python3 so I can't do this. One option appears to be to
detach the underlying io.BufferedReader behind
sys.stdin
and recreate it with your desired error handling. I
believe this would be:
b = sys.stdin.detach() sys.stdin = io.TextIOWrapper(b, errors="surrogateescape")
Your options for errors=
are documented in the codecs
module's
documentation on Error handlers.
You may prefer something like "backslashreplace" or "namereplace",
since they make the output UTF-8 correct. I'm old-fashioned, so I
prefer to pass through the bad bytes exactly as they are.
Another option is to directly use the underlying sys.stdin.buffer
object without changing sys.stdin
. This object supports all of
the usual IO methods like .readline()
, but it returns bytes instead
of strings; you can then deal with the bytes however you want, with
or without decoding them with some form of error handling. Similarly,
sys.stdout.buffer
takes bytes for .write()
, not strings. This
means that the trouble free way of copying standard input to standard
output is:
sys.stdout.buffer.write( sys.stdin.buffer.read() )
If you've previously written to the text mode sys.stdout
, you
need to flush it before you start this copy with 'sys.stdout.flush()
'.
If you omit this, Python may do odd and unhelpful things with your
initial output.
(This is probably all well known in the community of frequent Python developers, but these days I'm an infrequent Python programmer.)
|
|