How to handle Unicode character decoding errors depends on your goals
In a comment on my entry mulling over DWiki's Python 3 Unicode issues and what I plan to do about them, Sean A. asked a very good question about how I'm planning to handle errors when decoding things from theoretical UTF-8 input:
Out of curiosity, why use backslashreplace instead of surrogateescap? (I ask because it seems to me that surrogateescape also loses no information, is guaranteed to work with any binary input, and is designed for reading unknown encodings.)
Oh. And is trivial to convert back into the original binary data.
The reason I think I want Python's 'backslashreplace' error handling instead of 'surrogateescape' is that my ultimate goal is not to reproduce the input (in all its binary glory) in my output, but to produce valid UTF-8 output (for HTML, Atom syndication feeds, and so on) even if some of the input isn't valid.
(Another option is to abort processing if the input isn't valid, which is not what I want. It would be the most conservative and safe choice in some situations.)
Given that I'm going to produce valid UTF-8 no matter what, the choice
comes down to what generates more useful results for the person
reading what was invalid UTF-8 input. You can certainly do this
with 'surrogateescape' by just encoding to straight UTF-8 using the
'surrogatepass
' handler, but the resulting directly encoded surrogate
characters are not going to show up as anything useful and might produce
outright errors from some things (and possibly be misinterpreted under
some circumstances).
(With 'surrogateescape', bad characters are encoded to U+DC80 to U+DCFF, which is the 'low' part of the Unicode surrogates range. As Wikipedia notes, 'isolated surrogate code points have no general interpretation', and certainly they don't have a distinct visual representation.)
Out of all of Python's available codecs error handlers that
can be used when decoding from UTF-8 to Unicode, 'backslashreplace
'
is the one that preserves the most information in a visually clear
manner while still allowing you to easily produce valid UTF-8 output
that everyone is going to accept. The 'replace
' handler has the
drawback of making all invalid characters look the same and so
leaves you with no clues as to what they look like in the input,
and 'ignore
' just tosses them away entirely, leaving everyone
oblivious to the fact that bad characters were there in the first
place.
(In some situations this makes 'ignore
' the right choice, because
you may not want to give people any marker that something is wrong;
such a marker might only confuse them about something they can't
do anything about. But since I'm going to be looking at the rendered
HTML and so on myself, I want to have at least a chance to know
that DWiki is seeing bad input. And 'replace
' has the advantage
that it's visible but is less peculiar and noisy than 'backslashreplace
';
you might use it when you want some visual marker present that
things are a bit off, but don't want to dump a bucket of weird
backslashes on people.)
PS: This does mean that my choice here is a bit focused on what's useful for me. For me, having some representation of the actual bad characters visible in what I see gives me some idea of what to look for in the page source and what I'm going to have to fix. For other people, it's probably more going to be noise.
|
|