How to handle Unicode character decoding errors depends on your goals

January 27, 2019

In a comment on my entry mulling over DWiki's Python 3 Unicode issues and what I plan to do about them, Sean A. asked a very good question about how I'm planning to handle errors when decoding things from theoretical UTF-8 input:

Out of curiosity, why use backslashreplace instead of surrogateescap? (I ask because it seems to me that surrogateescape also loses no information, is guaranteed to work with any binary input, and is designed for reading unknown encodings.)

Oh. And is trivial to convert back into the original binary data.

The reason I think I want Python's 'backslashreplace' error handling instead of 'surrogateescape' is that my ultimate goal is not to reproduce the input (in all its binary glory) in my output, but to produce valid UTF-8 output (for HTML, Atom syndication feeds, and so on) even if some of the input isn't valid.

(Another option is to abort processing if the input isn't valid, which is not what I want. It would be the most conservative and safe choice in some situations.)

Given that I'm going to produce valid UTF-8 no matter what, the choice comes down to what generates more useful results for the person reading what was invalid UTF-8 input. You can certainly do this with 'surrogateescape' by just encoding to straight UTF-8 using the 'surrogatepass' handler, but the resulting directly encoded surrogate characters are not going to show up as anything useful and might produce outright errors from some things (and possibly be misinterpreted under some circumstances).

(With 'surrogateescape', bad characters are encoded to U+DC80 to U+DCFF, which is the 'low' part of the Unicode surrogates range. As Wikipedia notes, 'isolated surrogate code points have no general interpretation', and certainly they don't have a distinct visual representation.)

Out of all of Python's available codecs error handlers that can be used when decoding from UTF-8 to Unicode, 'backslashreplace' is the one that preserves the most information in a visually clear manner while still allowing you to easily produce valid UTF-8 output that everyone is going to accept. The 'replace' handler has the drawback of making all invalid characters look the same and so leaves you with no clues as to what they look like in the input, and 'ignore' just tosses them away entirely, leaving everyone oblivious to the fact that bad characters were there in the first place.

(In some situations this makes 'ignore' the right choice, because you may not want to give people any marker that something is wrong; such a marker might only confuse them about something they can't do anything about. But since I'm going to be looking at the rendered HTML and so on myself, I want to have at least a chance to know that DWiki is seeing bad input. And 'replace' has the advantage that it's visible but is less peculiar and noisy than 'backslashreplace'; you might use it when you want some visual marker present that things are a bit off, but don't want to dump a bucket of weird backslashes on people.)

PS: This does mean that my choice here is a bit focused on what's useful for me. For me, having some representation of the actual bad characters visible in what I see gives me some idea of what to look for in the page source and what I'm going to have to fix. For other people, it's probably more going to be noise.

Written on 27 January 2019.
« A piece of email malware that wanted to make sure we rejected it
The potential risk to ZFS created by the shift in its userbase »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jan 27 01:19:32 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.