Wandering Thoughts archives

2008-11-15

Getting Python's encoding and decoding straight

Because I have recently been confusing myself with this, and because it is not clearly documented in the Python standard library or the tutorial (at least not that I could spot):

  • .decode() decodes a plain string (ie, bytes in some encoding) to a Unicode string.
  • .encode() encodes a Unicode string into a plain string.

Feeding a plain string to the unicode() constructor is the same as calling .decode() on the string.

Things get weird if you call .encode() or .decode() on the 'wrong' type of thing. Calling .decode() on a Unicode string appears to first encode the Unicode string to a plain string, using Python's default locale, and then call .decode() on the result. Calling .encode() on a plain string attempts to decode the string to Unicode, using Python's default locale again, and then calls .encode() on the result.

(Feeding a Unicode string to the unicode() constructor is not the same thing as calling .decode() on the Unicode string; instead it is a no-op if you do not supply an encoding and an error if you do.)

This assumes that you are not using one of the special-purpose codecs like base64. Things like base64 will happily map straight from plain strings to their encoding, even if the plain string cannot be decoded to Unicode in the default locale. Special purpose codecs generally will only encode things that are in their decoding output format; calling them on the wrong thing will at the best cause a conversion to the other sort of string first (eg, using base64 on a Unicode string), complete with possible encoding errors, and at the worst will die with an internal error (eg, using idna on a plain string).

(Thus, in many ways it would be better if plain strings only had a .decode() method and Unicode strings only had an .encode() method. This would leave special purpose codecs out in the cold, but there could be a manual way of invoking them; as it is, Python has accepted a bunch of somewhat puzzling complexity in the name of rarely used generality.)

python/DecodingAndEncoding written at 01:54:21; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.