2008-11-15
Getting Python's encoding and decoding straight
Because I have recently been confusing myself with this, and because it is not clearly documented in the Python standard library or the tutorial (at least not that I could spot):
.decode()decodes a plain string (ie, bytes in some encoding) to a Unicode string..encode()encodes a Unicode string into a plain string.
Feeding a plain string to the unicode() constructor is the same as
calling .decode() on the string.
Things get weird if you call .encode() or .decode() on the 'wrong'
type of thing. Calling .decode() on a Unicode string appears to first
encode the Unicode string to a plain string, using Python's default
locale, and then call .decode() on the result. Calling .encode() on
a plain string attempts to decode the string to Unicode, using Python's
default locale again, and then calls .encode() on the result.
(Feeding a Unicode string to the unicode() constructor is not the
same thing as calling .decode() on the Unicode string; instead it is a
no-op if you do not supply an encoding and an error if you do.)
This assumes that you are not using one of the special-purpose codecs like base64. Things like base64 will happily map straight from plain strings to their encoding, even if the plain string cannot be decoded to Unicode in the default locale. Special purpose codecs generally will only encode things that are in their decoding output format; calling them on the wrong thing will at the best cause a conversion to the other sort of string first (eg, using base64 on a Unicode string), complete with possible encoding errors, and at the worst will die with an internal error (eg, using idna on a plain string).
(Thus, in many ways it would be better if plain strings only had a
.decode() method and Unicode strings only had an .encode() method.
This would leave special purpose codecs out in the cold, but there could
be a manual way of invoking them; as it is, Python has accepted a bunch
of somewhat puzzling complexity in the name of rarely used generality.)
2008-11-13
What the members of a Unicode conversion error object are
The codecs module's
register_error function has a rather scanty description of what
error handlers are called with. Since I have just dug into this,
here is what the members are for at least UnicodeDecodeError:
object |
The string that is being decoded to Unicode. |
encoding |
The encoding that the string is (theoretically) in. |
start |
The index in object of the first character
that could not be decoded. |
end |
One past the end character of this decoding problem. |
reason |
The human-readable error text. |
The substring object[start:end] is the character or character sequence
that had decoding problems (perhaps I should say 'byte', but this is not
yet Python 3K). You get a character sequence instead of a character in
situations where the first character is a valid start of a multi-byte
sequence, but subsequent characters have an error.
For example, in a theoretical UTF-8 encoded string, the first two
bytes of the three-byte sequence 0xe0 0x81 0x58 are valid parts of
a three-byte encoding, but the third byte is not. You would get a
UnicodeDecodeError object where start pointed to 0xe0, the first byte,
and end was just past 0x58, the third.
Also worth noting is that when things are being decoded to Unicode, the first element of the tuple your error handler returns (assuming that you want to keep going) has to be a unicode string.
Given all of this, we can put together a really simple decoding error handler that just replaces undecodable bytes with backslashed hex versions of themselves:
def bsreplacer(uerr):
c = uerr.object[uerr.start]
return (u"\\x%x" % ord(c), uerr.start+1)
import codecs
codecs.register_error("bsreplace", bsreplacer)
Note that we are playing fast and loose with multi-byte sequences; instead of handling the entire sequence, we just replace the first byte and restart decoding after it. In some situations this can malfunction and produce garbled output.
(One would think that the standard "backslashreplace" error handler would already do this, but unfortunately it doesn't handle decoding errors, only encoding errors.)
2008-11-12
Why not doing Unicode is easier than doing Unicode in Python
Okay, I have a confession: I continue to write Python programs that deal with text (to some degree) but which are not Unicode-aware in the approved manner. One of the reasons that I do this is that I can get away with it, but another reason is that not doing Unicode is easier than doing Unicode.
I don't mean this in the sense that there's less code if you don't do Unicode. I mean that doing Unicode confronts you with more decisions. When you do Unicode, you must convert between Unicode and encoded strings, which means that you must decide what to do when a conversion to or from Unicode fails. If you do not do Unicode, if you just slop plain strings around, at least your program will not explode, or mangle input (too much), or sprinkle ?'s all over things.
(Having to make these decisions is especially irritating if you are just passing the text through unaltered, which is a common case for me. And yes, in theory you'll only be dealing with well-encoded text and you can ignore all of this, but I am cautious enough to expect that sooner or later my programs will be fed text that is in a different encoding than the system locale or is otherwise mangled.)
It does not help that it is relatively hard to control what Python does
when a Unicode error happens. For example, there is no handy 'call this
function to handle unconvertible things' argument for .encode().
While you can register an error handler function with the codecs
module, the interfaces involved make it pretty clear that this is not
intended as a simple, casual thing.
(There are decent standard options for converting from Unicode back to text, in the "xmlcharrefreplace" and "backslashreplace" conversion options, but the standard options for converting from random mangled text to Unicode seem sparser and less satisfying; I would at least like to know what was replaced.)