== Python 3 module APIs and the question of Unicode conversion errors I have [[a little Python thing https://github.com/siebenmann/exim-attachment-logger]] to [[log MIME attachment type information from Exim ../sysadmin/EximOurAttachmentLogging]]; as has been my practice for some time, it's currently written for Python 2. For various reasons beyond the scope of this entry, today I decided to see if I could get it running with Python 3. In the process, I ran into what I have decided to consider a Python 3 API design question. My Python program peers inside tar, zip, and rar archives in order to get the extensions of files inside them, using the [[tarfile https://docs.python.org/3/library/tarfile.html]], [[zipfile https://docs.python.org/3/library/zipfile.html]], and [[rarfile https://pypi.python.org/pypi/rarfile/]] modules for this; the first two are in the standard library, the third is a PyPi addon. This means that the modules (and I) are dealing with file names that may come in from the outside world in essentially any encoding, or even none (as some joker may have stuffed random bytes into the alleged filenames, especially for tar archives). So, how do the modules behave here? Neither tarfile nor zipfile make any special comments about the file names that they return; in Python 3, this means that they should at least be returning regular (Unicode) strings all of the time, with no surprise bytestrings if they can't decode things. Rarfile supports both Python 2.7 and Python 3, so it sensibly explicitly specifies that its filenames are always Unicode. Tarfile has [[an explicit section on Unicode issues https://docs.python.org/3/library/tarfile.html#tar-unicode]] that answers all of my questions; the default behavior is sensible and you can change it if you want. Both [[zipfile]] and [[rarfile]] are more or less silent about Unicode issues for reading filenames in archives. Code inspection of _zipfile.py_ in Python 3.5 reveals that it makes no attempt to handle Unicode decoding errors when decoding filenames; if any occur, they will be passed up to you (and there is nothing you can do to set an error handling strategy). Rarfile attempts several encodings and if that fails, tries the default charset with a hard-coded 'replace' error handler. (On the other hand, many ZIP archives should theoretically not have filename decoding errors because the filenames should explicitly be in UTF-8 and [[zipfile]] decodes them as such. But I'm a sysadmin and [[I deal with network input ../web/BasicWebsiteSecurity]].) These three modules represent three different approaches to handling potential Unicode decoding errors in Python 3 in your API (and to documenting them); just assume that you're working in a properly encoded world (zipfile), fully delegate to the user (tarfile), or make a best effort and then punt (rarfile). Since two of these are in the standard library, I'm going to assume that there's no consensus so far on the right sort of API here among the Python 3 community. My personal preference is for the tarfile approach, since it clearly is the most flexible and powerful. However I think there's a reasonably coherent argument for the zipfile approach under some situations, namely that the module is (probably) not designed to deal with malformed ZIP archives in general. I'd certainly like it if the zipfile module didn't blow up on malformed ZIP archives, but my usage case is a somewhat odd one; most people aren't parsing potentially malicious ZIP archives. (Tarfile has no choice here, as there is no standard for what the filename encoding is in tar archives. A correctly formed ZIP archive that says 'this filename is UTF-8' should always have a filename that actually is UTF-8 and will decode without errors.)