== Some thoughts on Python 3 module APIs and Unicode conversion errors After [[yesterday's concrete discussion Python3UnicodeAPIQuestion]] I've kept thinking about some issues involved when designing Python 3 module APIs for things that may run into Unicode conversion errors. I won't say I've come up with principles, but I do have a few thoughts and ideas. Let's suppose that you're dealing with some form of structured external data, something that at least nominally has a specification; maybe a file archive format, or XML, or a network protocol, or the like. The big split is whether you consider a Unicode decoding (or encoding) error to make the data malformed, just like other ways to corrupt the data. Sometimes this is going to be very clear, such as if the specification mandates either a specific encoding or at least says that all data will be corrected encoded in a named encoding. Other times it will be equally clear that there is no such rule; the format simply shrugs its shoulders and says 'all bytes are valid except maybe one or two'. (Even if the format officially is 'all bytes accepted', you might decide that your module is only going to deal with the subset of the format where things are correctly encoded. Presumably you will document this.) If Unicode conversion errors are fatal things that mean the data is malformed, the only question is whether your module will encapsulate them and return your own _module.BadFormat_ exception or whether you will let the underlying exception escape untouched. [[My strong opinion is that your module should wrap underlying exceptions WrappingExceptions]] because this is far more predictable and easier for callers. Python 3 makes this exceptionally easy so I feel there's very little excuse for doing otherwise. (Of course various modules in the standard library disagree with me by not catching, say, _IOError_ and _OSError_ despite performing IO. In practice tarfile and zipfile can raise all sorts of exceptions, not just their own documented ones.) If your format is not one where things must always be properly encoded, then obviously your module should provide some way to deal with this. The official Python 3 approach is [[PEP 383 https://www.python.org/dev/peps/pep-0383/]]'s _surrogateescape_, and your format might also have its own rules for how to try to decode and encode things, or at least common heuristics you should try before falling back on the blunt hammer of [[PEP 383]]. I'm not going to object if you provide a way to change the error behavior here, but after thinking it over I don't think it's always required. (There will be situations where your caller knows that everything should be properly encoded and wants to error out if this isn't the case, rather than having to carefully check each string you return.) Your module is generally more broadly useful if you're willing to consider Unicode decoding problems to not be 'thing is corrupt' errors at least some of the time, regardless of what the format says. The reality of Internet life is that other people's software does make encoding mistakes, and sometimes it's nice to be able to process as much as possible anyways. I do think that all modules need to document how they fall on the 'everything is properly encoded or bust' line (well, all modules that deal with the outside world). Even if this is just 'all is encoded in UTF-8'.