Wandering Thoughts archives

2018-12-15

Python 3's approach to filenames and arguments is pragmatically right

A while back I read John Goerzen's The Python Unicode Mess, which decries the Python 3 mess of dealing with filenames and command line arguments on Unix that are not encoded in the program's assumed encoding. As Goerzen notes:

So if you want to actually handle Unix filenames properly in Python, you:

  • Must have a processing path that fully avoids Python strings.
  • Must use sys.{stdin,stdout}.buffer instead of just sys.stdin/stdout
  • Must supply filenames as bytes to various functions. See PEP 0471 for this comment: “Like the other functions in the os module, scandir() accepts either a bytes or str object for the path parameter, and returns the DirEntry.name and DirEntry.path attributes with the same type as path. However, it is strongly recommended to use the str type, as this ensures cross-platform support for Unicode filenames. (On Windows, bytes filenames have been deprecated since Python 3.3).” So if you want to be cross-platform, it’s even worse, because you can’t use str on Unix nor bytes on Windows.

Back in the days when it was new, Python 3 used to be very determined that Unix was Unicode/UTF-8. Years ago this was a big reason that I said you should avoid it from the perspective of a Unix sysadmin. These days things are better; we have things like os.environb and a relatively well defined way of handling sys.argv. This ultimately comes from PEP 383, which gave us the 'surrogateescape' error handler (see the codecs module).

All of this is irritating and unpleasant. Unfortunately, it's also the pragmatically right answer for reasons that PEP 383 alludes to, although it doesn't describe them the way that I would. PEP 383 says:

On the other hand, Microsoft Windows NT has corrected the original design limitation of Unix, and made it explicit in its system interfaces that these data (file names, environment variables, command line arguments) are indeed character data, by providing a Unicode-based API [...]

Let me translate this: filenames, command line arguments, and so on are no longer portable abstractions. They fundamentally mean different things on Unix and on Windows. On Windows, they are 'Unicode' (actually UTF-16) and may include characters not representable as single bytes, while on Unix they are and remain bytes and may include any byte value or sequence except 0. These are two incompatible types, especially once people start encoding non-ASCII filenames or command line arguments on Unix and want their programs to understand the decoded forms in Unicode.

(Or, if you prefer to flip this around, when people start using non-ASCII filenames and command line arguments and so on on Windows and want their programs to understand those as Unicode strings and characters.)

This is a hard problem and modern Python 3 has made the pragmatic choice that it's not going to pretend that things are portable when they aren't (early Python 3 tried to some extent and that blew up in its face). If you are working in the happy path on Unix where you're dealing with properly encoded data, you can ignore this by letting Python 3 automatically decode things to Unicode strs; otherwise, you must work with the raw (Unix) values, and Python 3 will provide them if you ask (and will surface at least some of them by default).

(There are other possible answers but I think that they're all worse than Python 3's current ones for reasons beyond the scope of this entry. For instance, I think that having os.listdir() return a different type on Windows than on Unix would be a bad mistake.)

I'll note that Python 2 is not magically better than Python 3 here. It's just that Python 2 chose to implicitly prioritize Unix over Windows by deciding that filenames, command line arguments, and so on were bytestrings instead of Unicode strings. I rather suspect that this caused Windows people using Python a certain amount of heartburn; we probably just didn't hear as much from them for various reasons.

(You can argue about whether or not Python 3 should have made Unicode the fundamental string type, but that decision was never a pragmatic one and it was made by Python developers very early on. Arguably it's the single decision that created 'Python 3' instead of an ongoing evolution of Python 2.)

PS: This probably counts as me partially or completely changing my mind about things I've said in the past. So be it; time changes us all, and I certainly have different and more positive views on Python 3 now.

python/Python3PragmaticFilenames written at 00:58:48; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.