Getting Python's encoding and decoding straight

November 15, 2008

Because I have recently been confusing myself with this, and because it is not clearly documented in the Python standard library or the tutorial (at least not that I could spot):

  • .decode() decodes a plain string (ie, bytes in some encoding) to a Unicode string.
  • .encode() encodes a Unicode string into a plain string.

Feeding a plain string to the unicode() constructor is the same as calling .decode() on the string.

Things get weird if you call .encode() or .decode() on the 'wrong' type of thing. Calling .decode() on a Unicode string appears to first encode the Unicode string to a plain string, using Python's default locale, and then call .decode() on the result. Calling .encode() on a plain string attempts to decode the string to Unicode, using Python's default locale again, and then calls .encode() on the result.

(Feeding a Unicode string to the unicode() constructor is not the same thing as calling .decode() on the Unicode string; instead it is a no-op if you do not supply an encoding and an error if you do.)

This assumes that you are not using one of the special-purpose codecs like base64. Things like base64 will happily map straight from plain strings to their encoding, even if the plain string cannot be decoded to Unicode in the default locale. Special purpose codecs generally will only encode things that are in their decoding output format; calling them on the wrong thing will at the best cause a conversion to the other sort of string first (eg, using base64 on a Unicode string), complete with possible encoding errors, and at the worst will die with an internal error (eg, using idna on a plain string).

(Thus, in many ways it would be better if plain strings only had a .decode() method and Unicode strings only had an .encode() method. This would leave special purpose codecs out in the cold, but there could be a manual way of invoking them; as it is, Python has accepted a bunch of somewhat puzzling complexity in the name of rarely used generality.)

Written on 15 November 2008.
« How to force a crash dump on Solaris 10 x86
A hint for email providers »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Nov 15 01:54:21 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.