Getting Python's encoding and decoding straight
Because I have recently been confusing myself with this, and because it is not clearly documented in the Python standard library or the tutorial (at least not that I could spot):
.decode()
decodes a plain string (ie, bytes in some encoding) to a Unicode string..encode()
encodes a Unicode string into a plain string.
Feeding a plain string to the unicode()
constructor is the same as
calling .decode()
on the string.
Things get weird if you call .encode()
or .decode()
on the 'wrong'
type of thing. Calling .decode()
on a Unicode string appears to first
encode the Unicode string to a plain string, using Python's default
locale, and then call .decode()
on the result. Calling .encode()
on
a plain string attempts to decode the string to Unicode, using Python's
default locale again, and then calls .encode()
on the result.
(Feeding a Unicode string to the unicode()
constructor is not the
same thing as calling .decode()
on the Unicode string; instead it is a
no-op if you do not supply an encoding and an error if you do.)
This assumes that you are not using one of the special-purpose codecs like base64. Things like base64 will happily map straight from plain strings to their encoding, even if the plain string cannot be decoded to Unicode in the default locale. Special purpose codecs generally will only encode things that are in their decoding output format; calling them on the wrong thing will at the best cause a conversion to the other sort of string first (eg, using base64 on a Unicode string), complete with possible encoding errors, and at the worst will die with an internal error (eg, using idna on a plain string).
(Thus, in many ways it would be better if plain strings only had a
.decode()
method and Unicode strings only had an .encode()
method.
This would leave special purpose codecs out in the cold, but there could
be a manual way of invoking them; as it is, Python has accepted a bunch
of somewhat puzzling complexity in the name of rarely used generality.)
|
|