Python 3 module APIs and the question of Unicode conversion errors
I have a little Python thing to log MIME attachment type information from Exim; as has been my practice for some time, it's currently written for Python 2. For various reasons beyond the scope of this entry, today I decided to see if I could get it running with Python 3. In the process, I ran into what I have decided to consider a Python 3 API design question.
My Python program peers inside tar, zip, and rar archives in order to get the extensions of files inside them, using the tarfile, zipfile, and rarfile modules for this; the first two are in the standard library, the third is a PyPi addon. This means that the modules (and I) are dealing with file names that may come in from the outside world in essentially any encoding, or even none (as some joker may have stuffed random bytes into the alleged filenames, especially for tar archives). So, how do the modules behave here?
Neither tarfile nor zipfile make any special comments about the
file names that they return; in Python 3, this means that they
should at least be returning regular (Unicode) strings all of the
time, with no surprise bytestrings if they can't decode things.
Rarfile supports both Python 2.7 and Python 3, so it sensibly
explicitly specifies that its filenames are always Unicode. Tarfile
has an explicit section on Unicode issues that
answers all of my questions; the default behavior is sensible and
you can change it if you want. Both zipfile and rarfile are
more or less silent about Unicode issues for reading filenames in
archives. Code inspection of
zipfile.py in Python 3.5 reveals
that it makes no attempt to handle Unicode decoding errors when
decoding filenames; if any occur, they will be passed up to you
(and there is nothing you can do to set an error handling strategy).
Rarfile attempts several encodings and if that fails, tries the
default charset with a hard-coded 'replace' error handler.
(On the other hand, many ZIP archives should theoretically not have filename decoding errors because the filenames should explicitly be in UTF-8 and zipfile decodes them as such. But I'm a sysadmin and I deal with network input.)
These three modules represent three different approaches to handling potential Unicode decoding errors in Python 3 in your API (and to documenting them); just assume that you're working in a properly encoded world (zipfile), fully delegate to the user (tarfile), or make a best effort and then punt (rarfile). Since two of these are in the standard library, I'm going to assume that there's no consensus so far on the right sort of API here among the Python 3 community.
My personal preference is for the tarfile approach, since it clearly is the most flexible and powerful. However I think there's a reasonably coherent argument for the zipfile approach under some situations, namely that the module is (probably) not designed to deal with malformed ZIP archives in general. I'd certainly like it if the zipfile module didn't blow up on malformed ZIP archives, but my usage case is a somewhat odd one; most people aren't parsing potentially malicious ZIP archives.
(Tarfile has no choice here, as there is no standard for what the filename encoding is in tar archives. A correctly formed ZIP archive that says 'this filename is UTF-8' should always have a filename that actually is UTF-8 and will decode without errors.)