What the members of a Unicode conversion error object are

November 13, 2008

The codecs module's register_error function has a rather scanty description of what error handlers are called with. Since I have just dug into this, here is what the members are for at least UnicodeDecodeError:

object The string that is being decoded to Unicode.
encoding The encoding that the string is (theoretically) in.
start The index in object of the first character that could not be decoded.
end One past the end character of this decoding problem.
reason The human-readable error text.

The substring object[start:end] is the character or character sequence that had decoding problems (perhaps I should say 'byte', but this is not yet Python 3K). You get a character sequence instead of a character in situations where the first character is a valid start of a multi-byte sequence, but subsequent characters have an error. For example, in a theoretical UTF-8 encoded string, the first two bytes of the three-byte sequence 0xe0 0x81 0x58 are valid parts of a three-byte encoding, but the third byte is not. You would get a UnicodeDecodeError object where start pointed to 0xe0, the first byte, and end was just past 0x58, the third.

Also worth noting is that when things are being decoded to Unicode, the first element of the tuple your error handler returns (assuming that you want to keep going) has to be a unicode string.

Given all of this, we can put together a really simple decoding error handler that just replaces undecodable bytes with backslashed hex versions of themselves:

def bsreplacer(uerr):
    c = uerr.object[uerr.start]
    return (u"\\x%x" % ord(c), uerr.start+1)

import codecs
codecs.register_error("bsreplace", bsreplacer)

Note that we are playing fast and loose with multi-byte sequences; instead of handling the entire sequence, we just replace the first byte and restart decoding after it. In some situations this can malfunction and produce garbled output.

(One would think that the standard "backslashreplace" error handler would already do this, but unfortunately it doesn't handle decoding errors, only encoding errors.)


Comments on this page:

From 65.172.155.230 at 2008-11-13 18:18:44:

Hmm, if you say so.

I recently just gave up with python's unicode handling and ported the utf8 functions from http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c into yum, you want at least:

http://yum.baseurl.org/gitweb?p=yum.git;a=commitdiff;h=ea1caa2bdd7e9d8fd22e2eeebef53ad810c94fdb

http://yum.baseurl.org/gitweb?p=yum.git;a=commitdiff;h=809f033c400b124668f74b44834cf71a76a4fe13

...if you want the copy/paste method. I'm probably going to turn them into their own module eventually (probably as soon as I/someone hit the same pain anywhere else with python code and want it).

Written on 13 November 2008.
« Why not doing Unicode is easier than doing Unicode in Python
How to force a crash dump on Solaris 10 x86 »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Nov 13 00:18:24 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.