Wandering Thoughts archives

2019-12-22

Filenames and paths should be a unique type and not a form of strings

I recently read John Goerzen's The Fundamental Problem in Python 3, which talks about Python 3's issues in environments where filenames (and other things) are not in a uniform and predictable encoding. As part of this, he says:

[...]. Critically, most of the Python standard library treats a filename as a String – that is, a sequence of valid Unicode code points, which is a subset of the valid POSIX filenames.

[...]

From a POSIX standpoint, the correct action would have been to use the bytes type for filenames; this would mandate proper encode/decode calls by the user, but it would have been quite clear. [...]

This is correct only from a POSIX standpoint, and then only sort of (it's correct in traditional Unix filesystems but not necessarily all current ones; some current Unix filesystems can restrict filenames to properly encoded UTF-8). The reality of modern life for a language that wants to work on Windows as well as Unix is that filenames must be presented as a unique type, not any form of strings or bytes.

How filenames and paths are represented depends on the operating system, which means that for portability filenames and paths need to be an opaque type that you have to explicitly insert string-like information into and extract string-like information out of, specifying the encoding if you don't want an opaque byte sequence of unpredictable contents. As with all encoding related operations, this can fail in both directions under some circumstances.

Of course this is not the Python 3 way. The Python 3 way is to pretend that everything is fine and that the world is all UTF-8 and Unicode. This is pretty much the pragmatically correct choice, at least if you want to have Windows as a first class citizen of your world, but it is not really the correct way. As with all aspects of its handling of strings and Unicode, Python 3 chose convenience over reality and correctness, and has been patching up the resulting mess on Unix since its initial release.

If Python was going to do this correctly, Python 3 would have been the time to do it; since it was breaking things in general, it could have introduced a distinct type and required that everything involving file names change to taking and returning that type. But that would have made porting Python 2 code harder and would have made it less likely that Python 3 was accepted by Python programmers, which is probably one reason it wasn't done.

(I don't think it was the only one; early Python 3 shows distinct signs that the Python developers had more or less decided to only support Unix systems where everything was proper UTF-8. This turned out to not be a viable position for them to maintain, so modern Python 3 is somewhat more accommodating of messy reality.)

FilenamesUniqueType written at 01:46:30; Add Comment

2019-12-14

It's unfortunately time to move away from using '/usr/bin/python'

For a long time, the way to make Python programs runnable on Unix has been to start them with '#!/usr/bin/python' or sometimes '#!/usr/bin/env python' (and then chmod them executable, of course; this makes them scripts). Unfortunately this is no longer a good idea for general Python programs, for the simple reason that current Unixes now disagree on what version of Python is '/usr/bin/python'. Instead, we all need to start explicitly specifying what version of Python we want by using '/usr/bin/python3' or '/usr/bin/python2' (or by having env explicitly run python3 or python2).

For a long time, even after Python 3 came out, it seemed like /usr/bin/python would stay being Python 2 in many environments (ones where you had Python 2 and Python 3 installed side by side). I expected a deprecation of /usr/bin/python as Python 2 to take years after Python 2 itself was no longer supported, for the simple reason that there are a lot of programs and instructions out there that expect their '#!/usr/bin/python' or 'python' to run Python 2. Changing what that meant seemed reasonably disruptive, even if it was the theoretically correct and pure way.

In reality, as I recently found out, Fedora 31 switched what /usr/bin/python means, and apparently Arch Linux did it several years ago. In theory PEP 394 describes the behavior here and this behavior is PEP-acceptable. In practice, before early July of 2019, PEP 394 said that 'python' should be Python 2 unless the user had explicitly changed it or a virtual environment was active. Then, well, there was a revision that basically threw up its hands and said that people could do whatever they wanted to with /usr/bin/python (via).

(This makes PEP 394 a documentation standard. As with all documentation standards, it needs to describe reality to be useful, and the reality is that /usr/bin/python is now completely unpredictable.)

Since Fedora and Arch Linux have led the way here, other Linux distributions will probably follow. In particular, since Red Hat Enterprise is more or less based on Fedora, I wouldn't be surprised to see RHEL 9 have /usr/bin/python be Python 3. I don't think Debian and thus Ubuntu will be quite this aggressive just yet, but I wouldn't be surprised if in a couple of years /usr/bin/python at least defaults to Python 3 on Ubuntu. (Hopefully Python 2 will still be available as a package.)

UsrBinPythonNoMore written at 00:55:20; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.