Wandering Thoughts archives

2012-07-29

The periodic strangeness of idiomatic Python

Suppose that you want to do something N times, for whatever reason. In C, the straightforward and idiomatic way to do this is a for loop; 'for (i = 0; i < times; i++) { .... }'. Since Python doesn't have this form of a for loop, the Python equivalent is a while loop. However, many people would probably say that this isn't idiomatic Python. What I think of as the idiomatic Python way to do 'do something N times' is:

for _ in range(0, times):
  ....

(Some people will use xrange() instead of range() here.)

This is certainly what instantly popped into my head when I ran into this situation recently and at first I didn't think any more of it. But once I began actually looking at this it started getting stranger and stranger, less like a clear language idiom and much more like a convention. Let me run down a number of the ways that this is strange:

  • It's a rather indirect way of expressing 'do something N times'. The C for loop is pretty direct by contrast.

    (With that said, I'm not sure a while loop would be that much more direct. The directness advantage that C has is that all parts of the for loop's control are there in one chunk; a while loop spreads them out in three different lines.)

  • We're doing things in this odd way partly to use as many builtins as possible, often in the name of (nominal) efficiency. Yes, this avoids a couple of extra lines to initialize and increment an otherwise unused counter, but I don't think that really makes it clearer.
  • In the pursuit of this idiom we're creating a list or at least an iterator and walking it, throwing away the result. In many languages this would be wince-inducingly inefficient (or at least much worse than basic integer arithmetic with a variable). It's a (probable) win in CPython because of the whole builtins vs non-builtins issue.

    (Not only is range() a builtin, but for with iterators has direct bytecode support.)

  • You pretty much need to know this idiom in order to understand this code without a bunch of thought (which is not the case for the C version). A special tricky point is the use of `_' as a special variable name used to indicate 'I don't care about this variable, I just have to have something here'; this is entirely a convention in (some) Python programming circles, with no special meaning in the language itself.

    (As a corollary, I doubt that this is an idiom that would naturally occur to people who are not already immersed in Python.)

  • When using this idiom you'd better remember the exact effects of range()/xrange(), since eg 'range(1, times)' is very much not what you want.

    (Again the C equivalent has this clearly visible.)

The overall summary of this is that the Python idiom really is close to being an idiom, in the literal definition of the word: it is an expression whose meaning is not clearly and immediately understandable from a quick read of its component parts. By contrast the C idiom is much clearer (at least for me).

(I don't think that all of this makes the Python idiom bad; it remains the most compact and probably the most efficient way of expressing this. And even without knowing this idiom off the top of your head I think it's reasonably clear roughly what it does (and it's reasonably easy to work out all of the details).)

IdiomStrangeness written at 01:21:56; Add Comment

2012-07-18

Strings in Python 2 and Python 3

This started life as a reply to a comment on my entry about my issues with Unicode in Python 3 but grew, so I'm making it into an entry of its own. A commentator wrote:

If you want to read a sequence of bytes -- from, say, a file -- you can do that in Python 3. You just have to explicitly ask for it, and the datatype you get back will not be str. It shouldn't be! A str is meant to represent an abstract sequence of characters, and bytes are not that.

I disagree with this view of str and strings. What strings represent is a (subjective) language design decision, not the universal answer presented here. Python 3 chooses to say that strings and str should represent Unicode code points, while Python 2 and plenty of other languages have decided that they represent raw bytes. Neither is right or wrong, although the second is both less abstract and far more common.

(Note that Unicode code points are both more and less than abstract characters; the two are definitely not the same thing.)

What Python 2 used to do was read in sequences of bytes and decode them for you, assuming they were ASCII-encoded. That led to oodles of problems where people would write code that worked fine until they received a non-ASCII character, and then crash horribly.

This is not what Python 2 did at all; if anything, it's more a description of how Python 3 works, since Python 3 really wants to automatically decode things to Unicode the moment your program looks at them. Both Python 2 and Python 3 use your locale's encoding as the default character encoding, not ASCII.

(ASCII comes into it because people operating in the C locale get ASCII as their character encoding, at least in CPython, and you wind up in this locale if your locale information is unset.)

The general difference between Python 2 and Python 3 is in two things. First, Python 3's interfaces normally all return Unicode strings and Python 2's interfaces normally return (byte) strings; for example, if you do .read() from a normally opened file you get back a byte string in Python 2 and a Unicode string in Python 3. Second, Python 2 will try to convert byte strings to Unicode strings if you try to do something that combines the two and Python 3 will not (you'll get various error messages about being unable to mix bytes and str). Note that both Python 2 and Python 3 will try to convert back and forth between Unicode and bytes if you're trying to interact with the outside world with Unicode. If anything Python 3 does more automatic conversions here because more of its interfaces with the outside world default to using Unicode.

(This means that quite a lot of operations can raise UnicodeDecodeError in Python 3, which has consequences for any code that believes it's handling all file IO errors by catching EnvironmentError.)

Python 2 code works fine with random non-ASCII characters if you don't ever try to convert things to Unicode (I have plenty of code like this). What trips people up is mixing Unicode and non-Unicode strings because then you have bytestrings being decoded to Unicode at random times where you didn't realize it (and so didn't catch decoding errors).

Python 3 solves this problem by force majeure, in that it no longer does these automatic up-conversions. If it had been content to stop there things would be fine; instead, it decided to also add a lot more automatic conversions (for various reasons). These automatic conversions are just as problematic as before but have the minor improvement that they now occur mostly at the boundaries of your program instead of at random points throughout it.

In other words, the failure points were still there in Python 2. They were just implicitly called instead of explicitly.

As should now be clear, I strongly disagree with this. It takes a significant amount of effort to use Python 3 without implicit failure points and is in fact relatively unnatural, while it's easy to use Python 2 without them.

StringsPython2And3 written at 02:02:30; Add Comment

2012-07-16

My arrogance about Unicode and character encodings

Yesterday I described how I could get away with ignoring encoding issues and thus how forced Unicode was and is irritating. However there is a gotcha in my approach, one that hides behind a bit of arrogance. Let me repeat the core bit of how my programs typically work:

What they process is a mixture of ASCII (for keywords, directives, and so on, all of the things the program had to interpret) and uninterpreted bytestrings, which are simply regurgitated to the user as-is in appropriate situations.

This simple, reasonable description contains an assumption: this approach assumes that any encoding will be a superset of ASCII, because it assumes that code can extract plain text ASCII from a file without knowing the file's encoding. This works if and only if the file's actual encoding is implemented as ASCII plus other stuff hiding around the edges, which is true for many encodings including UTF-8 but not for all of them.

This is the arrogance of my blithe approach to ignoring character encoding issues. It assumes either that all character sets are a superset of ASCII or that any exceptions are sufficiently uncommon that I don't have to care about them. Of course, by assuming that my programs will never be used by people with such character sets I've insured that they never will be.

The conclusion that I draw from this is I can't ignore character encoding unless I'm willing to be somewhat arrogant. The pain of dealing with decoding and encoding issues is simply the price of not being arrogant.

(On the other hand it's still very tempting to be arrogant this way, for reasons that boil down to 'I can get away with it because the environments where it matters are probably quite rare, and it's much easier'.)

UnicodeArrogance written at 01:16:11; Add Comment

2012-07-15

My general issue with Unicode in Python 3

I've written a number of things that amount to grumbles that Python 3 forces mandatory Unicode handling down people's throats where before I didn't need to deal with encoding issues. To start with it's worth explaining why I could say this with a straight face.

I could get away with this ignorance because my programs almost invariably work in a particular way. What they process is a mixture of ASCII (for keywords, directives, and so on, all of the things the program had to interpret) and uninterpreted bytestrings, which are simply regurgitated to the user as-is in appropriate situations. Since the bytestrings are simply repeated verbatim (without any alteration), my code doesn't need to know or care what encoding they're in; in fact, attempting to decode the bytestrings to Unicode and then re-encode them for output introduces two new failure points.

(Related to this is the pattern where a Unix program doesn't care what encoding its command line arguments are in because once again they're being used as uninterpreted bytestrings for, eg, filenames. Forcing the program or runtime environment to decode these to Unicde then adds heartburn and a potential failure point for no gain.)

The writing and attitudes around early versions of Python 3 made it clear that you weren't supposed to do this any more. The new Pythonic way to operate on 'strings' was to decode them to Unicode immediately and then work in Unicode, re-encoding on output. Where operating on plain un-decoded strings was even possible it was made at least somewhat annoying, partly by limiting how much you could do with such strings. Current versions of Python 3 seem to have relaxed a little bit but all sorts of things still push you in the 'decode immediately' direction.

All of this changed focus to 'decode immediately' was (and for that matter still is) irritating to me. If you decode you have to deal with encoding issues, which means that now my programs could blow up parsing their configuration files or the like. This struck me as a lot like the experience of strict error handling in XHTML (where if anything anywhere went wrong you got nothing).

(Forced decoding turns out to cause all sorts of bad problems on Unix, because fundamental Unix interfaces are byte-string interfaces.)

What I've written here may sound seductive and reasonable but there's a gotcha to it, an inobvious bit of arrogance and blindness that I've only slowly woken up to recently. Since this entry is already long enough, that's a topic for the next entry.

Sidebar: about encoding mistakes in those uninterpreted bytestrings

My view of potential encoding mistakes in those uninterpreted bytestrings is a pragmatic one: if I'm just repeating them verbatim, any garbage output that results is not my problem and not my fault. My program is simply doing what you told it to do.

Python3UnicodeIssue written at 02:19:09; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.