2016-08-31
Python 3 module APIs and the question of Unicode conversion errors
I have a little Python thing to log MIME attachment type information from Exim; as has been my practice for some time, it's currently written for Python 2. For various reasons beyond the scope of this entry, today I decided to see if I could get it running with Python 3. In the process, I ran into what I have decided to consider a Python 3 API design question.
My Python program peers inside tar, zip, and rar archives in order to get the extensions of files inside them, using the tarfile, zipfile, and rarfile modules for this; the first two are in the standard library, the third is a PyPi addon. This means that the modules (and I) are dealing with file names that may come in from the outside world in essentially any encoding, or even none (as some joker may have stuffed random bytes into the alleged filenames, especially for tar archives). So, how do the modules behave here?
Neither tarfile nor zipfile make any special comments about the
file names that they return; in Python 3, this means that they
should at least be returning regular (Unicode) strings all of the
time, with no surprise bytestrings if they can't decode things.
Rarfile supports both Python 2.7 and Python 3, so it sensibly
explicitly specifies that its filenames are always Unicode. Tarfile
has an explicit section on Unicode issues that
answers all of my questions; the default behavior is sensible and
you can change it if you want. Both zipfile and rarfile are
more or less silent about Unicode issues for reading filenames in
archives. Code inspection of zipfile.py
in Python 3.5 reveals
that it makes no attempt to handle Unicode decoding errors when
decoding filenames; if any occur, they will be passed up to you
(and there is nothing you can do to set an error handling strategy).
Rarfile attempts several encodings and if that fails, tries the
default charset with a hard-coded 'replace' error handler.
(On the other hand, many ZIP archives should theoretically not have filename decoding errors because the filenames should explicitly be in UTF-8 and zipfile decodes them as such. But I'm a sysadmin and I deal with network input.)
These three modules represent three different approaches to handling potential Unicode decoding errors in Python 3 in your API (and to documenting them); just assume that you're working in a properly encoded world (zipfile), fully delegate to the user (tarfile), or make a best effort and then punt (rarfile). Since two of these are in the standard library, I'm going to assume that there's no consensus so far on the right sort of API here among the Python 3 community.
My personal preference is for the tarfile approach, since it clearly is the most flexible and powerful. However I think there's a reasonably coherent argument for the zipfile approach under some situations, namely that the module is (probably) not designed to deal with malformed ZIP archives in general. I'd certainly like it if the zipfile module didn't blow up on malformed ZIP archives, but my usage case is a somewhat odd one; most people aren't parsing potentially malicious ZIP archives.
(Tarfile has no choice here, as there is no standard for what the filename encoding is in tar archives. A correctly formed ZIP archive that says 'this filename is UTF-8' should always have a filename that actually is UTF-8 and will decode without errors.)
The various IDs of disks, filesystems, software RAID, LVM, et al in Linux
Once upon a time, you put simple /dev/sdX
names in your /etc/fstab
.
These days that's boring and deprecated, and so there are a large
number of different identifiers that you can use here. Since I just
confused myself on this today, I want to write down what I know and
looked up about the various levels and sorts of identifiers, and
where they come from. What I care about here are identifiers that
are tied to a specific piece of hardware or data, instead of where
that hardware is plugged into the system
(or the order in which it's recognized during boot, which can totally
jump around even when no hardware changes).
Some filesystems have labels, or at least can have labels, and years
ago it was common for Linux installs to set labels on your filesystems
and use them in /etc/fstab
via LABEL=...
. This has fallen out
of favour since then, for reasons I can only theorize about. ExtN
is one such filesystem, and labels can be inspected (and perhaps
set) with e2label
. Modern Linux distributions seem to no longer
set a label on the extN filesystems that they create during
installation. Just to confuse you, extN filesystems also keep track
of where they were last mounted (or are mounted), which is different
from the extN label, and some tools will present this as the 'name'
of the filesystem.
(e2label
is effectively obsolete today; you should use blkid
.)
Many filesystems have UUIDs, as
do swap areas, software RAID arrays, LVM objects, and a number of
other things. UUIDs are what is commonly used in /etc/fstab
these
days, and can be displayed with eg 'lsblk -fs
'. The blkid
command
is generally the master source of information about any particular
thing. Like labels, UUIDs are embedded in the on-disk metadata of
various things; for extN filesystems the filesystem UUID is in the
superblock, for example. Where software RAID stores its metadata
varies and can matter for some
things. Note that software RAID has both a UUID for the overall
array and a device UUID for each physical device in the array.
(As blkid
will report, GPT partitions themselves
have a partition label and a theoretically unique partition UUID. These can also be used in
/etc/fstab
, per the fstab
manpage, but you probably don't want
to. The GPT UUID is stored as part of the GPT partition table, not
embedded in the partition itself.)
Physical disks have serial numbers (and World Wide Names) that theoretically
uniquely identify them. Where they're accessible, Linux reads these
via SCSI, SAS, iSCSI, SATA, and so on inquiry commands, and uses
this information to populate /dev/disk/by-id
. In addition to
actual disks, generally anything that appears as a disk-like device
with a UUID (or a name) will also show up in /dev/disk/by-id
.
Thus you can find things like software RAID arrays (by name and
UUID), LVM physical volumes, and LVM logical volumes (by name and
ID).
(I believe that some USB disk enclosures don't pass through the necessary stuff for Linux to get the disk's serial number.)
Sometimes this can get confusing because the same object winds up
with multiple IDs at different levels. A software RAID array or a
LVM logical volume that contains an extN filesystem has both a UUID
for the filesystem and a UUID for the array or volume, and it may
not be clear which UUID you're actually using unless you look in
detail. Using blkid
is generally fairly clear, fortunately;
lsblk
's default output is not so much from what I've seen.
(If you're looking at an /etc/fstab
generated by an installer
or the like, they generally use the filesystem UUID.)