Things to do in Python 3 when your Unix standard input is badly encoded

October 29, 2021

Today I had a little adventure with Python 3. I have a program that takes standard input, reads and lightly processes a bunch of headers before writing them out, then just copies the body (of an email message, as it happens) from standard input to standard output. Normally it gets well formed input, with no illegally encoded UTF-8. Today, there were some stray bytes and the world blew up. Dealing with this was far harder than it should have been, partly because the documentation has issues.

Although the documentation for sys.stdin will not tell you this, sys.stdin most likely has the API of io.TextIOBaseWrapper. Otherwise, your only method for finding out what attributes and methods it supports is the ever friendly 'help(type(sys.stdin))' in a Python interpreter. If you're on Python 3.7 or later, what you probably want to do about a badly encoded standard input is change how it handles encoding errors with .reconfigure():


Now that I've learned about this, I think that you should generally do this as the first operation in any Python 3 program that reads from standard input, unless you are absolutely sure that the input being not well-formed UTF-8 is a fatal error (it almost never is).

Unfortunately for me, Ubuntu 18.04 LTS has Python 3.6.9 as its /usr/bin/python3 so I can't do this. One option appears to be to detach the underlying io.BufferedReader behind sys.stdin and recreate it with your desired error handling. I believe this would be:

b = sys.stdin.detach()
sys.stdin = io.TextIOWrapper(b, errors="surrogateescape")

Your options for errors= are documented in the codecs module's documentation on Error handlers. You may prefer something like "backslashreplace" or "namereplace", since they make the output UTF-8 correct. I'm old-fashioned, so I prefer to pass through the bad bytes exactly as they are.

Another option is to directly use the underlying sys.stdin.buffer object without changing sys.stdin. This object supports all of the usual IO methods like .readline(), but it returns bytes instead of strings; you can then deal with the bytes however you want, with or without decoding them with some form of error handling. Similarly, sys.stdout.buffer takes bytes for .write(), not strings. This means that the trouble free way of copying standard input to standard output is:

sys.stdout.buffer.write( )

If you've previously written to the text mode sys.stdout, you need to flush it before you start this copy with 'sys.stdout.flush()'. If you omit this, Python may do odd and unhelpful things with your initial output.

(This is probably all well known in the community of frequent Python developers, but these days I'm an infrequent Python programmer.)

Comments on this page:

I've always thought it was a bad idea for Python 3 to make the standard streams default to Unicode text instead of bytes. First, Python has to guess what the right encoding is, and that can be problematic. Second, what if the standard streams are pipes instead of TTYs? Then even guessing the encoding doesn't make sense. And third, if one of the main points of Python 3 was to improve how encoding/decoding between bytes and Unicode is handled, by making those things explicit instead of implicit, then making the standard streams Unicode text by default is a step backward, not a step forward.

Written on 29 October 2021.
« We're seeing increasingly targeted and dangerous phish spam attempts
Why browsers are driven to offer some degree of remote control »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 29 00:20:08 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.