Filenames and paths should be a unique type and not a form of strings

December 22, 2019

I recently read John Goerzen's The Fundamental Problem in Python 3, which talks about Python 3's issues in environments where filenames (and other things) are not in a uniform and predictable encoding. As part of this, he says:

[...]. Critically, most of the Python standard library treats a filename as a String – that is, a sequence of valid Unicode code points, which is a subset of the valid POSIX filenames.

[...]

From a POSIX standpoint, the correct action would have been to use the bytes type for filenames; this would mandate proper encode/decode calls by the user, but it would have been quite clear. [...]

This is correct only from a POSIX standpoint, and then only sort of (it's correct in traditional Unix filesystems but not necessarily all current ones; some current Unix filesystems can restrict filenames to properly encoded UTF-8). The reality of modern life for a language that wants to work on Windows as well as Unix is that filenames must be presented as a unique type, not any form of strings or bytes.

How filenames and paths are represented depends on the operating system, which means that for portability filenames and paths need to be an opaque type that you have to explicitly insert string-like information into and extract string-like information out of, specifying the encoding if you don't want an opaque byte sequence of unpredictable contents. As with all encoding related operations, this can fail in both directions under some circumstances.

Of course this is not the Python 3 way. The Python 3 way is to pretend that everything is fine and that the world is all UTF-8 and Unicode. This is pretty much the pragmatically correct choice, at least if you want to have Windows as a first class citizen of your world, but it is not really the correct way. As with all aspects of its handling of strings and Unicode, Python 3 chose convenience over reality and correctness, and has been patching up the resulting mess on Unix since its initial release.

If Python was going to do this correctly, Python 3 would have been the time to do it; since it was breaking things in general, it could have introduced a distinct type and required that everything involving file names change to taking and returning that type. But that would have made porting Python 2 code harder and would have made it less likely that Python 3 was accepted by Python programmers, which is probably one reason it wasn't done.

(I don't think it was the only one; early Python 3 shows distinct signs that the Python developers had more or less decided to only support Unix systems where everything was proper UTF-8. This turned out to not be a viable position for them to maintain, so modern Python 3 is somewhat more accommodating of messy reality.)


Comments on this page:

By anon at 2019-12-29 09:32:41:

extra points for os x which does unicode normalisation differently, eg. "ü" on windows/linux/bsd is just that, whereas on os x it's "u¨". fun times transferring files between systems...

By mk-fg at 2020-01-25 23:41:19:

I suspect long-term py3 might come around to your suggestion via pathlib, actually.

It wasn't there at first, then as an option, then supported in all os.* stuff, then recommended over os.path or open() and such, and next step might be phasing-out path strings entirely. And by that point, it can accomodate for any kind of underlying os quirks internally.

Anecdotally, came to always use it for paths these days (as it's way more clean and convenient than os.path), especially with convenience stuff like .read_text()/.read_bytes() and .write_text()/.write_bytes() (which is like 80%-90% of what paths are used for).

Written on 22 December 2019.
« My new Linux office workstation disk partitioning for the end of 2019
OpenBSD has to be a BSD Unix and you couldn't duplicate it with Linux »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Dec 22 01:46:30 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.