Wandering Thoughts archives

2016-08-31

Python 3 module APIs and the question of Unicode conversion errors

I have a little Python thing to log MIME attachment type information from Exim; as has been my practice for some time, it's currently written for Python 2. For various reasons beyond the scope of this entry, today I decided to see if I could get it running with Python 3. In the process, I ran into what I have decided to consider a Python 3 API design question.

My Python program peers inside tar, zip, and rar archives in order to get the extensions of files inside them, using the tarfile, zipfile, and rarfile modules for this; the first two are in the standard library, the third is a PyPi addon. This means that the modules (and I) are dealing with file names that may come in from the outside world in essentially any encoding, or even none (as some joker may have stuffed random bytes into the alleged filenames, especially for tar archives). So, how do the modules behave here?

Neither tarfile nor zipfile make any special comments about the file names that they return; in Python 3, this means that they should at least be returning regular (Unicode) strings all of the time, with no surprise bytestrings if they can't decode things. Rarfile supports both Python 2.7 and Python 3, so it sensibly explicitly specifies that its filenames are always Unicode. Tarfile has an explicit section on Unicode issues that answers all of my questions; the default behavior is sensible and you can change it if you want. Both zipfile and rarfile are more or less silent about Unicode issues for reading filenames in archives. Code inspection of zipfile.py in Python 3.5 reveals that it makes no attempt to handle Unicode decoding errors when decoding filenames; if any occur, they will be passed up to you (and there is nothing you can do to set an error handling strategy). Rarfile attempts several encodings and if that fails, tries the default charset with a hard-coded 'replace' error handler.

(On the other hand, many ZIP archives should theoretically not have filename decoding errors because the filenames should explicitly be in UTF-8 and zipfile decodes them as such. But I'm a sysadmin and I deal with network input.)

These three modules represent three different approaches to handling potential Unicode decoding errors in Python 3 in your API (and to documenting them); just assume that you're working in a properly encoded world (zipfile), fully delegate to the user (tarfile), or make a best effort and then punt (rarfile). Since two of these are in the standard library, I'm going to assume that there's no consensus so far on the right sort of API here among the Python 3 community.

My personal preference is for the tarfile approach, since it clearly is the most flexible and powerful. However I think there's a reasonably coherent argument for the zipfile approach under some situations, namely that the module is (probably) not designed to deal with malformed ZIP archives in general. I'd certainly like it if the zipfile module didn't blow up on malformed ZIP archives, but my usage case is a somewhat odd one; most people aren't parsing potentially malicious ZIP archives.

(Tarfile has no choice here, as there is no standard for what the filename encoding is in tar archives. A correctly formed ZIP archive that says 'this filename is UTF-8' should always have a filename that actually is UTF-8 and will decode without errors.)

python/Python3UnicodeAPIQuestion written at 22:54:44; Add Comment

The various IDs of disks, filesystems, software RAID, LVM, et al in Linux

Once upon a time, you put simple /dev/sdX names in your /etc/fstab. These days that's boring and deprecated, and so there are a large number of different identifiers that you can use here. Since I just confused myself on this today, I want to write down what I know and looked up about the various levels and sorts of identifiers, and where they come from. What I care about here are identifiers that are tied to a specific piece of hardware or data, instead of where that hardware is plugged into the system (or the order in which it's recognized during boot, which can totally jump around even when no hardware changes).

Some filesystems have labels, or at least can have labels, and years ago it was common for Linux installs to set labels on your filesystems and use them in /etc/fstab via LABEL=.... This has fallen out of favour since then, for reasons I can only theorize about. ExtN is one such filesystem, and labels can be inspected (and perhaps set) with e2label. Modern Linux distributions seem to no longer set a label on the extN filesystems that they create during installation. Just to confuse you, extN filesystems also keep track of where they were last mounted (or are mounted), which is different from the extN label, and some tools will present this as the 'name' of the filesystem.

(e2label is effectively obsolete today; you should use blkid.)

Many filesystems have UUIDs, as do swap areas, software RAID arrays, LVM objects, and a number of other things. UUIDs are what is commonly used in /etc/fstab these days, and can be displayed with eg 'lsblk -fs'. The blkid command is generally the master source of information about any particular thing. Like labels, UUIDs are embedded in the on-disk metadata of various things; for extN filesystems the filesystem UUID is in the superblock, for example. Where software RAID stores its metadata varies and can matter for some things. Note that software RAID has both a UUID for the overall array and a device UUID for each physical device in the array.

(As blkid will report, GPT partitions themselves have a partition label and a theoretically unique partition UUID. These can also be used in /etc/fstab, per the fstab manpage, but you probably don't want to. The GPT UUID is stored as part of the GPT partition table, not embedded in the partition itself.)

Physical disks have serial numbers (and World Wide Names) that theoretically uniquely identify them. Where they're accessible, Linux reads these via SCSI, SAS, iSCSI, SATA, and so on inquiry commands, and uses this information to populate /dev/disk/by-id. In addition to actual disks, generally anything that appears as a disk-like device with a UUID (or a name) will also show up in /dev/disk/by-id. Thus you can find things like software RAID arrays (by name and UUID), LVM physical volumes, and LVM logical volumes (by name and ID).

(I believe that some USB disk enclosures don't pass through the necessary stuff for Linux to get the disk's serial number.)

Sometimes this can get confusing because the same object winds up with multiple IDs at different levels. A software RAID array or a LVM logical volume that contains an extN filesystem has both a UUID for the filesystem and a UUID for the array or volume, and it may not be clear which UUID you're actually using unless you look in detail. Using blkid is generally fairly clear, fortunately; lsblk's default output is not so much from what I've seen.

(If you're looking at an /etc/fstab generated by an installer or the like, they generally use the filesystem UUID.)

linux/IDsForDisksAndFilesystems written at 00:12:52; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.