Wandering Thoughts archives


An example of a subtle over-broad try in Python

Today I wrote some code to winnow a list of users to 'real' users with live home directories that looks roughly like the following:

for uname, hdir in userlist:
      st = os.stat(hdir)
      if not stat.S_ISDIR(st.st_mode) or \
         stat.S_IMODE(st.st_mode) == 0:
      # looks good:
      print uname
   except EnvironmentError:
      # accept missing homedir; might be a
      # temporarily missing NFS mount, we
      # can't tell.
      print uname

This code has a relatively subtle flaw because I've accidentally written an over-broad exception catcher here.

As suggested by the comment, when I wrote this code I intended the try block to catch the case where the os.stat failed. The flaw here is that print itself does IO (of course) and so can raise an IO exception. Since I have the print inside my try block, a print-raised IO exception will get caught by it too. You might think that this is harmless because the except will re-do the print and thus presumably immediately have the exception raised again. This contains two assumptions: that the exception will be raised again and that if it isn't, the output is in a good state (as opposed to, say, having written only partial output before an error happened). Neither are entirely sure things and anyways, we shouldn't be relying on this sort of thing when it's really easy to fix. Since both branches of the exception end up at the same print, all we have to do is move it outside the try: block entirely (the except case then becomes just 'pass').

(My view is that print failing is unusual enough that I'm willing to have the program die with a stack backtrace, partly because this is an internal tool. If that's not okay you'd need to put the print in its own try block and then do something if it failed, or have an overall try block around the entire operation to catch otherwise unexpected EnvironmentError exceptions.)

The root cause here is that I wasn't thinking of print as something that does IO that can throw exceptions. Basic printing is sufficiently magical that it feels different and more ordinary, so it's easy to forget that this is a possibility. It's especially easy to overlook because it's extremely uncommon for print to fail in most situations (although there are exceptions, especially in Python 3). You can also attribute this to a failure to minimize what's done inside try blocks to only things that absolutely have to be there, as opposed to things that are just kind of convenient for the flow of code.

As a side note, one of the things that led to this particular case is that I changed my mind about what should happen when the os.stat() failed because I realized that failure might have legitimate causes instead of being a sign of significant problems with an account that should cause it to be skipped. When I changed my mind I just did a quick change to what the except block did instead of totally revising the overall code, partly because this is a small quick program instead of a big system.

python/SubtleBroadTry written at 22:34:55; Add Comment

The potential issue with Go's strings

As I mentioned back in Things I like about Go, one of the Go things that I really like is its strings (and slices in general). From the perspective of a Python programmer, what makes them great is that creating strings is cheap because they often don't require a copy. In Python, any time you touch a string you're copying some or all of it and this can easily have a real performance impact. Writing performant Python code requires considering this carefully. In Go, pretty much any string operation that just takes a subset of the string (eg trimming whitespace from the front and the end) is copy-free, so you can throw around string operations much more freely. This can make a straightforward algorithm both the right solution to your problem and pretty efficient.

(Not all interesting string operations are copy-free, of course. For example, converting a string to all upper case requires a copy, although Go's implementation is clever enough to avoid this if the string doesn't change, eg because it's already all in upper case.)

But this goodness necessarily comes with a potential badness, which is that those free substrings keep the entire original string alive in memory. What makes Go strings (and slices) so cheap is that they are just references to some chunk of underlying storage (the real data for the string or the underlying array for a slice); making a new string is just creating a new reference. But Go doesn't (currently) do partial garbage collection of string data or arrays, so if even one tiny bit of it is referred to somewhere the entire object must be retained. In other words, a string that's a single character is (currently) enough to keep a big string from being garbage collected.

This is not an issue that many people will run into, of course. To hit it you need to either be dealing with very big original strings or care a lot about memory usage (or both) and on top of that you have to create persistent small substrings of the non-persistent original strings (well, what you want to be non-persistent). Many usage patterns won't hit this; your original strings are not large, your subsets cover most of the original string anyways (for example if you break it up into words), or even the substrings don't live very long. In short, if you're an ordinary Go programmer you can ignore this. The people who care are handling big strings and keeping small chunks of them for a long time.

(This is the kind of thing that I notice because I once spent a lot of effort to make a Python program use as little memory as possible even though it was parsing and storing chunks out of a big configuration file. This made me extra-conscious about things like string lifetimes, single-copy interned strings, and so on. Then I wrote a parser in Go, which made me consider all of these issues all over again and caused me to realize that the big string representing my entire input file was going to be kept in memory due to the bits of it that my parser was clipping out and keeping.)

By the way, I think that this is the right tradeoff for Go to make. Most people using strings will never run into this, while it's very useful that substrings are cheap. And this sort of cheap substrings also makes less work for the garbage collector; instead of a churn of variable length strings when code is using a lot of substrings (as happens in Python), you just have a churn of fixed-size string references.

Of course there's the obvious fix if your code starts running into this: create a function that 'minimizes' a string by turning it into a []byte and then back. This creates a minimized string at the cost of an extra copy over the theoretical ideal implementation and can be trivially done in Go today.

Sidebar: How strings.ToUpper() et al avoid unnecessary copies

All of the active transformation functions like ToUpper() and ToTitle() are implemented using strings.Map() and functions from the unicode package. Map() is smart enough to not start making a new string until the mapping function returns a different rune than the existing one. As a result, any similar direct use of Map() that your code has will get this behavior for free.

programming/GoStringsMemoryHolding written at 00:45:35; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.