Some thoughts on Python 3 module APIs and Unicode conversion errors
After yesterday's concrete discussion I've kept thinking about some issues involved when designing Python 3 module APIs for things that may run into Unicode conversion errors. I won't say I've come up with principles, but I do have a few thoughts and ideas.
Let's suppose that you're dealing with some form of structured external data, something that at least nominally has a specification; maybe a file archive format, or XML, or a network protocol, or the like. The big split is whether you consider a Unicode decoding (or encoding) error to make the data malformed, just like other ways to corrupt the data. Sometimes this is going to be very clear, such as if the specification mandates either a specific encoding or at least says that all data will be corrected encoded in a named encoding. Other times it will be equally clear that there is no such rule; the format simply shrugs its shoulders and says 'all bytes are valid except maybe one or two'.
(Even if the format officially is 'all bytes accepted', you might decide that your module is only going to deal with the subset of the format where things are correctly encoded. Presumably you will document this.)
If Unicode conversion errors are fatal things that mean the data
is malformed, the only question is whether your module will encapsulate
them and return your own
module.BadFormat exception or whether
you will let the underlying exception escape untouched. My strong
opinion is that your module should wrap underlying exceptions because this is far more predictable and easier
for callers. Python 3 makes this exceptionally easy so I feel there's
very little excuse for doing otherwise.
(Of course various modules in the standard library disagree with
me by not catching, say,
OSError despite performing
IO. In practice tarfile and zipfile can raise all sorts of exceptions,
not just their own documented ones.)
If your format is not one where things must always be properly
encoded, then obviously your module should provide some way to deal
with this. The official Python 3 approach is PEP 383's
and your format might also have its own rules for how to try to
decode and encode things, or at least common heuristics you should
try before falling back on the blunt hammer of PEP 383. I'm not
going to object if you provide a way to change the error behavior
here, but after thinking it over I don't think it's always required.
(There will be situations where your caller knows that everything should be properly encoded and wants to error out if this isn't the case, rather than having to carefully check each string you return.)
Your module is generally more broadly useful if you're willing to consider Unicode decoding problems to not be 'thing is corrupt' errors at least some of the time, regardless of what the format says. The reality of Internet life is that other people's software does make encoding mistakes, and sometimes it's nice to be able to process as much as possible anyways.
I do think that all modules need to document how they fall on the 'everything is properly encoded or bust' line (well, all modules that deal with the outside world). Even if this is just 'all <X> is encoded in UTF-8'.