How to help programmers (part 1): the os.listdir() problem

December 7, 2008

How to help (Unix) programmers: silently omit data under certain circumstances. From the Python 3000 release notes:

Note that when os.listdir() returns a list of strings, filenames that cannot be decoded properly are omitted rather than raising UnicodeError.

(Background: in Python 3k, os.listdir() is normally called with a directory name that is a str Unicode object.)

Yes, os.listdir() had problems, but this is not a solution; this is making the problems worse. Before, at least you found out if you had a problem in this area. Now you will get mysterious reports that your program doesn't process all of the files that are there, on some platforms.

What this suggests is that os.listdir() is actually not a portable interface. On Unix it fundamentally deals with with byte-strings, and attempts to paper over that cause explosions; on other platforms, in at least some circumstances, it fundamentally deals with Unicode strings, and you get the same explosions in the other direction. Hiding the explosions doesn't make them go away, it just makes the problem harder to diagnose.

(Of course, the problem is worse than just os.listdir(); all things that take or return filenames on Unix fundamentally deal with byte-strings, not Unicode strings.)

Comments on this page:

From at 2008-12-08 09:40:29:

Yes Chris -- yes. Making a problem worse is no way to move a language forward.

-- joe

From at 2008-12-08 14:37:41:

"os.listdir() returns a list of bytes instances if the argument is a bytes instance" (from the same bullet point on the python what's new document you linked)

So while it's not perfectly portable it's not like it's a problem without a solution. In fact, one could argue that the bytes solution is the portable one as it should work everywhere.

By cks at 2008-12-08 17:41:33:

The byte-string solution cannot work everywhere either, because there are systems that genuinely require Unicode for os.listdir() et al, and thus require you to .decode() your byte-strings before calling them with all the possibilities for failures and problems that that implies.

(I believe that this implies that on such systems, os.listdir() also intrinsically returns Unicode. If you require os.listdir() to return byte-strings, you're back to the encoding problem again.)

From at 2008-12-10 23:47:06:

People have talked about a similar problem with os.environ and CGI -- specifically os.environ['PATH_INFO'] is in the browser's encoding, not the "system" encoding. So if you have a CGI script running UTF8 and a system with Latin1, you'll get corrupted data from os.environ.

This would not be so bad, if there was a way to get bytes from os.environ -- then CGI scripts would use that, and other kinds of scripts would os.environ (or maybe a little of both, depending on the information being passed around).

Generally there seems to be a problem that people are mistaking bytes for text. environ['REQUEST_METHOD'] is a byte value. There are no unicode HTTP methods. But it looks like text, so people mistakenly think it should be unicode in Python 3. -- Ian Bicking

Written on 07 December 2008.
« One of Python's problems with packages
How I split up my workstation's disk space »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Dec 7 18:46:50 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.