Some thoughts on Python 3 module APIs and Unicode conversion errors

September 2, 2016

After yesterday's concrete discussion I've kept thinking about some issues involved when designing Python 3 module APIs for things that may run into Unicode conversion errors. I won't say I've come up with principles, but I do have a few thoughts and ideas.

Let's suppose that you're dealing with some form of structured external data, something that at least nominally has a specification; maybe a file archive format, or XML, or a network protocol, or the like. The big split is whether you consider a Unicode decoding (or encoding) error to make the data malformed, just like other ways to corrupt the data. Sometimes this is going to be very clear, such as if the specification mandates either a specific encoding or at least says that all data will be corrected encoded in a named encoding. Other times it will be equally clear that there is no such rule; the format simply shrugs its shoulders and says 'all bytes are valid except maybe one or two'.

(Even if the format officially is 'all bytes accepted', you might decide that your module is only going to deal with the subset of the format where things are correctly encoded. Presumably you will document this.)

If Unicode conversion errors are fatal things that mean the data is malformed, the only question is whether your module will encapsulate them and return your own module.BadFormat exception or whether you will let the underlying exception escape untouched. My strong opinion is that your module should wrap underlying exceptions because this is far more predictable and easier for callers. Python 3 makes this exceptionally easy so I feel there's very little excuse for doing otherwise.

(Of course various modules in the standard library disagree with me by not catching, say, IOError and OSError despite performing IO. In practice tarfile and zipfile can raise all sorts of exceptions, not just their own documented ones.)

If your format is not one where things must always be properly encoded, then obviously your module should provide some way to deal with this. The official Python 3 approach is PEP 383's surrogateescape, and your format might also have its own rules for how to try to decode and encode things, or at least common heuristics you should try before falling back on the blunt hammer of PEP 383. I'm not going to object if you provide a way to change the error behavior here, but after thinking it over I don't think it's always required.

(There will be situations where your caller knows that everything should be properly encoded and wants to error out if this isn't the case, rather than having to carefully check each string you return.)

Your module is generally more broadly useful if you're willing to consider Unicode decoding problems to not be 'thing is corrupt' errors at least some of the time, regardless of what the format says. The reality of Internet life is that other people's software does make encoding mistakes, and sometimes it's nice to be able to process as much as possible anyways.

I do think that all modules need to document how they fall on the 'everything is properly encoded or bust' line (well, all modules that deal with the outside world). Even if this is just 'all <X> is encoded in UTF-8'.

Written on 02 September 2016.
« Python 3 module APIs and the question of Unicode conversion errors
Why semantic versioning is not going to solve all our problems »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 2 02:34:32 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.