Wandering Thoughts archives

2014-08-18

An example of a subtle over-broad try in Python

Today I wrote some code to winnow a list of users to 'real' users with live home directories that looks roughly like the following:

for uname, hdir in userlist:
   try:
      st = os.stat(hdir)
      if not stat.S_ISDIR(st.st_mode) or \
         stat.S_IMODE(st.st_mode) == 0:
            continue
      # looks good:
      print uname
   except EnvironmentError:
      # accept missing homedir; might be a
      # temporarily missing NFS mount, we
      # can't tell.
      print uname

This code has a relatively subtle flaw because I've accidentally written an over-broad exception catcher here.

As suggested by the comment, when I wrote this code I intended the try block to catch the case where the os.stat failed. The flaw here is that print itself does IO (of course) and so can raise an IO exception. Since I have the print inside my try block, a print-raised IO exception will get caught by it too. You might think that this is harmless because the except will re-do the print and thus presumably immediately have the exception raised again. This contains two assumptions: that the exception will be raised again and that if it isn't, the output is in a good state (as opposed to, say, having written only partial output before an error happened). Neither are entirely sure things and anyways, we shouldn't be relying on this sort of thing when it's really easy to fix. Since both branches of the exception end up at the same print, all we have to do is move it outside the try: block entirely (the except case then becomes just 'pass').

(My view is that print failing is unusual enough that I'm willing to have the program die with a stack backtrace, partly because this is an internal tool. If that's not okay you'd need to put the print in its own try block and then do something if it failed, or have an overall try block around the entire operation to catch otherwise unexpected EnvironmentError exceptions.)

The root cause here is that I wasn't thinking of print as something that does IO that can throw exceptions. Basic printing is sufficiently magical that it feels different and more ordinary, so it's easy to forget that this is a possibility. It's especially easy to overlook because it's extremely uncommon for print to fail in most situations (although there are exceptions, especially in Python 3). You can also attribute this to a failure to minimize what's done inside try blocks to only things that absolutely have to be there, as opposed to things that are just kind of convenient for the flow of code.

As a side note, one of the things that led to this particular case is that I changed my mind about what should happen when the os.stat() failed because I realized that failure might have legitimate causes instead of being a sign of significant problems with an account that should cause it to be skipped. When I changed my mind I just did a quick change to what the except block did instead of totally revising the overall code, partly because this is a small quick program instead of a big system.

SubtleBroadTry written at 22:34:55; Add Comment

2014-07-09

What the differences are between Python bools and ints

I mentioned in the previous entry that Python's bool class is actually a subclass of int (and the bool docstring will tell you this if you bother to read it with help() before, say, diving into the CPython source code like a system programmer). Since I was just looking at this, I might as well write down the low-level differences between ints and bools. Bools have:

  • a custom __repr__ that reports True or False instead of the numeric value; this is also used as the custom __str__ for bool.

    (The code is careful to intern these strings so that no matter how many times you repr() or str() a boolean, only one copy of the literal 'True' or 'False' string will exist.)

  • a __new__ that returns either the global True object or the global False object depending on the truth value of what it's given.

  • custom functions for &, |, and ^ that implement boolean algebra instead of the standard bitwise operations if both arguments are either True or False. Note that eg 'True & 1' results in a bitwise operation and an int object, even though 1 is strongly equal to True.

That's it.

I'm not quite sure how bool blocks being subclassed and I'm not curious enough right now to work it out.
Update: see the comments for the explanation.

The global True and False objects are of course distinct from what is in effect the global 0 and 1 objects that are all but certain to exist. This means that their id() is different (at least in CPython), since the id() is the memory address of their C-level object struct.

(In modern versions of both CPython 2 and CPython 3 it turns out that global 0 and 1 objects are guaranteed to exist, because 'small integers' between -5 and 257 are actually preallocated as the interpreter is initializing itself.)

BoolVsInt written at 00:27:09; Add Comment

2014-07-08

Exploring a surprise with equality in Python

Taken from @hackedy's tweet, here's an interesting Python surprise:

>>> {1: "one", True: "two"}
{1: 'two'}
>>> {0: "one", False: "two"}
{0: 'two'}

There are two things happening here to create this surprise. The starting point is this:

>>> print hash(1), hash(True)
1 1

At one level, Python has made True have the same hash value as 1. Actually that's not quite right, so let me show you the real story:

>>> isinstance(True, int)
True

Python has literally made bool, the type that True and False are instances of, be a subclass of int. They not merely look like numbers, they are numbers. As numbers their hash identity is their literal value of 1 or 0, and of course they also compare equal to literal 1 or 0. Since they hash to the same identity and compare equal, we run into the issue with 'custom' equalities and hashes in dictionaries where Python considers the two different objects to be the same key and everything gets confusing.

(That True and False hash to the same thing as 1 and 0 is probably not a deliberate choice. The internal bool type doesn't have a custom hash function; it just tacitly inherits the hash function of its parent, ie int. I believe that Python could change this if it wanted to, which would make the surprise here go away.)

The other thing is what happens when you create a dictionary with literal syntax, which is that Python generates bytecode that stores each initial value into the dictionary one after another in the order that you wrote them. It turns out that when you do a redundant store into a dictionary (ie you store something for a key that already exists), Python only replaces the value, not both the key and the value. This is why the result is not '{True: 'two'}'; only the value got overwritten in the second store.

(This decision is a sensible one because it may avoid object churn and the overhead associated with it. If Python replaced the key as well it would at least be changing more reference counts on the key objects. And under normal circumstances you're never going to notice the difference unless you're going out of your way to look.)

PS: It turns out that @hackedy beat me to discovering that bools are ints. Also the class documentation for bool says this explicitly (and notes that bool is one of the rare Python classes that can't be subclassed).

EqualityDictSurprise written at 23:57:23; Add Comment

2014-07-05

Another reason to use frameworks like Django

The traditional reason to use web app frameworks like Django is that doing saves you time and perhaps gives you a more solid and polished result, possibly with useful extra features like Django's admin interface. But it has recently struck me that in many situations there is another interesting reason for using frameworks (or a defence of doing so instead of writing your own code).

Let's start by assuming that your application really needs at least some of the functionality you're using from the framework. For example, perhaps you're using the ORM and database functionality because that's what the framework makes easiest (this is our reason) but you really need the URL routing and HTML form handling and validation. Regardless of whether or not you used a framework, your application needs some code somewhere that does all of this necessary work. With a framework, the code mostly lives in the framework and you call the framework; without a framework, you would have to write your own code for it (and you use it directly). The practical reality is that the code for the functionality your application genuinely needs has to come from somewhere, either from a framework (if you use one) or from your own work and code.

If you write your own code, what are the odds that it will be as well documented and as solid as the code in a framework? Which will likely be easier for a co-worker to pick up later, custom code that you wrote from scratch or code that calls a standard framework in a standard or relatively standard way? If you only need a little bit of functionality and thus only need to write a little bit of code, this can certainly work out. But if you need a lot of functionality, so much that you're duplicating a lot of what a framework does, well, I am not so optimistic, because in effect what you're really doing is creating a custom one-off framework.

This suggests an obvious way to balance out whether or not to use a framework (or from some perspectives, to inflict either a framework or your own collection of code on your co-workers). To maximize the benefits of using a framework you should be writing as little of your own code as possible, talking to the framework in its standard way, and the framework needs to be well documented, because all of this plays to the strengths of the framework over your own code. If the framework is hard to pick up, your code to deal with it is complex, and replacing it would only be a modest amount of custom code, well, the case for your own code is strong.

(I'm not sure this way of thinking has anything to say about the ever popular arguments over minimal frameworks versus big frameworks with 'batteries included' and good PR. A big framework might be worse because it requires you to learn more before you can start using the corner of it you need, or it might be better because you need less custom code to connect various minimal components together. It certainly feels like how much of the framework you need ought to matter, but I'm not sure this intuition is correct.)

FrameworkUsageReason written at 01:31:39; Add Comment

2014-06-23

Python 3 has already succeeded in the long run

Somewhat recently there has been a certain amount of commotion and strong articles about what Python needs to do about the Python 2 versus Python 3 in order to for one or the other or both to grow and be successful. People's prescriptions have been all over the map; I think I've even seen articles advocating discarding Python 3 as a failed experiment and going back to Python 2 development as the way forward. Given that I'm strongly down on Python 3, you might expect me to agree with the people calling for some sort of major course correction in overall Python development (even if I don't necessarily agree with their suggested changes). As it happens, I don't. My view is that Python development has no need to change in order to 'save' Python or to make Python 3 a success.

The reality is that Python 3 has already succeeded, like it or not. It passed critical mass a long time ago, and while it may be slow moving today it's become inevitable in the long run. More and more major packages are being ported to Python 3, more and more coding is being done on on Python 3 instead of Python 2, more and more people are advocating for Python 3 and its advantages, Python 3 is in more and more places (even if it's sometimes slow), and in the long run more and more things are going to only support Python 3.

If Python 3 was going to fail we would have already seen that by now. We would have seen things like major Python packages announcing that they would never be porting to Python 3, no matter what; they would maintain and develop Python 2 versions only. Instead we've seen just about the opposite, where almost everyone has some sort of Python 3 porting story going on even if they haven't delivered it yet.

Ultimately all of this has happened because the Python developers have made it clear that it is their way (ie Python 3 as it is today) or the highway. No one has stepped up to present a serious alternative to this (ie to further develop Python 2) and given that Python 3 is still broadly Python and is a reasonably viable option most people are not going to take off for the highway. They may be writing Python 2 code today (and perhaps griping about the situation, as I do) but the odds are quite good that in five or ten years they will have switched to Python 3 for new work just because, well, they might as well, it works okay, it's not that different, and it's better supported and shinier than Python 2.

(Old code may or may not ever get ported from Python 2 to Python 3, but then a lot of code dies out over a span of five or ten years anyways.)

Oh, sure, some people will take off for the highway. Some of them will be vocal and thus disproportionately visible. I may even wind up being one of them, if I decide that Go is really what I want to be coding everything in. But I think that most people are going to stay with Python and that means that most of them will move to Python 3 sooner or later. The question is no longer 'if' but 'when', and that means Python 3 has already won even if it takes ten years for it to be pervasive.

Note that this is very different from whether or not I think that that Python development should change. I definitely wish that it would change and I think that relatively modest changes could make a gradual switch to Python 3 noticeably easier. But the Python developers are not listening to me and I think that the current state of affairs demonstrates that they have no need to do so.

PS: this doesn't mean that I plan to switch to writing new code in Python 3 any time soon. Everything I wrote before about that still applies today. I don't expect to start writing Python 3 code until doing so is clearly easier and better than trying to do the same thing in Python 2 (due to nice features in Python 3, better package support, and so on, all of which I expect to happen someday but not any time soon for the work I do).

Python3HasSucceeded written at 21:53:47; Add Comment

2014-06-20

What Python versions I can use (June 2014 edition)

Because I like depressing myself and being harsh on Python 3, and also because Ubuntu 14.04 LTS and Red Hat Enterprise 7 were both released relatively recently, I'm going to inventory what versions of Python (both 2 and 3) are available to me on the various machines that I use and care about. Let's start with the two recently released major Linux distributions, because there's some terrible news for Python 3.

As I write this, the current versions of Python are 2.7.7, released May 31 2014, and 3.4.1, released May 19 2014 (Python release dates and release notes are all nicely accessible here).

Ubuntu 14.04 LTS ships with Python 2.7.6 and Python 3.4.0. These are almost current; in fact Python 3.4.0 was released at most a month before 14.04 itself was (14.04 was released in mid-April) and at the time of its release 14.04 shipped with the latest available versions of Python 2 and 3. Unfortunately but typically, Ubuntu probably won't update either over the lifetime of 14.04 LTS. Ubuntu LTS people are stuck with these versions for the next two years.

Red Hat Enterprise 7 ships with Python 2.7.5 and as far as I can tell no version of Python 3 at all in the standard version; possibly you can install a version of Python 3.3 in /opt through the RHEL/CentOS 'SCL' system. The upshot is that unless you go significantly out of your way you're not going to be using Python 3 on a RHEL 7 machine. RHEL and CentOS people are likely stuck with this whole situation for three or four years until the next version of RHEL is released.

We're still actively using Ubuntu 12.04 LTS and 10.04 LTS; these have Python 2.7.3 and Python 2.6.5 respectively. For Python 3 they have 3.3.2 and 3.1.2. Our remaining 10.04 machines will go away over the next year, bringing me up to some version of 2.7 on all our Ubuntu machines. Python 3 usage on either machine is probably hopeless, especially on 10.04; my impression is that if you're going to code to Python 3 you should be using 3.4 if at all possible and certainly some relatively recent version.

The current version of OmniOS ships with Python 2.6.8 and no version of Python 3. According to the OmniOS people you are not supposed to use their system version of Python but instead build your own at whatever version you want; however, their system version is camping on the /usr/bin/python name so in practice that version is what we'll use on OmniOS for script portability and so on (we have lots of scripts that start '#!/usr/bin/python' and we have very little interest in changing that). This means that I can't use 2.7 only features in portable system management Python scripts (or in scripts intended to run on our fileservers).

My Fedora 20 machines have Python 2.7.5 and 3.3.2 (which is the same version as is on Fedora 19). Based on evidence so far, Fedora probably won't update either version, so for more recent versions I get to wait for Fedora 21 (expected in the fall).

Wandering Thoughts is currently hosted on a FreeBSD 9.2 machine with Python 2.7.5 and no version of Python 3 installed. I don't know FreeBSD well enough to figure out from their website which version or versions FreeBSD 10 comes with, and it's probably not relevant to me anyways; the current host probably won't upgrade any time soon (and I'll probably never port DWiki to Python 3).

We don't currently have any Debian machines, but we might in the future if the Debian LTS effort catches on. Debian 6 'squeeze' (the only current LTS release) has Python 2.6.6 and Python 3.1.3, but we're unlikely to adopt that since it was released back in 2011. Debian 7 'wheezy' has Python 2.7.3 and unfortunately only Python 3.2.3.

I no longer care about the state of Python on Solaris; we're replacing our Solaris fileservers with new OmniOS fileservers.

The good news is that I can assume 2.6 everywhere and 2.7 on most places. The bad news for fans of Python 3 is that the Python 3 situation is still a disaster, alternating between outdated versions and being entirely missing. The odds of getting a good, up to date version of Python 3 on a machine when you type python3 is still rather low.

(See also the 2012 version of this. It's a bit striking how little has changed in two years, although some of that is that Python 2 is not exactly moving fast.)

MyPythonVersions2014-06 written at 02:33:57; Add Comment

2014-05-22

Why Python uses bytecode (well, probably): it's simpler

A while back I read Why are there so many Pythons?, which in passing talks about the Python's internal use of bytecode and says:

In very brief terms: machine code is much faster, but bytecode is more portable and secure.

If you replace 'is' with 'can be', this is true. But it's not the reason that the main implementation of Python (hereafter 'CPython') uses bytecode. One clue that this isn't the case is that the .pyc files of bytecodes are not all that portable; they can differ between Python versions and possibly even different types of machines with the same version of CPython.

Put simply, CPython almost certainly uses bytecode because creating and then interpreting bytecode is a common implementation technique for writing reasonably complex interpreters. All interpreters need to parse their source language and then interpret (and execute) something, but it's often simpler (and faster) to transform the initial source language into some simpler format before interpreting it. A common 'simpler format' is some form of abstract bytecode, often extremely specialized to the language being interpreted and also how data is stored inside the interpreter.

(On modern CPUs, another advantage of transforming things from an abstract syntax tree of the parsed code to bytecode is that the bytecode can be made linear in memory and thus more cache friendly. Modern CPUs really don't like bouncing around all over to follow pointers; the less you do it the better.)

CPython's bytecode is just this sort of abstract bytecode. In theory you could describe it as a simple stack machine, but in practice the stack is only used for storing temporary values while computing expressions and so on. Actual Python variables and so on are accessed through a whole series of specialized bytecode instructions. The bytecode also has special instructions for things like creating instances of standard types like lists and dealing with iterators, none of which can be described as either general-purpose outside of Python or anything like a real computer.

(And sometimes the exact details of this bytecode matter.)

As for security, Python bytecode is not necessarily all that secure by itself. While it doesn't allow you to perform random machine operations, I wouldn't be surprised if hand-generating crazy instruction sequences could do things like crash CPython (in fact, I'm pretty confidant that doing this is relatively trivial) and lead to arbitrary code execution. The CPython bytecode interpreter is not intended as a general interpreter but instead as an interpreter for bytecode generated by CPython itself, which is guaranteed to obey the rules and not do things like attempt to get or set nonexistent function local variables.

Or to put it directly: it is not safe at all to have CPython run untrusted bytecode, even in a theoretically relatively captive environment. This is completely independent of what access the bytecode might have (or be able to contrive) to standard library functions like file and network access. Untrusted bytecode doesn't need access to stuff like that to wreak havoc.

(I can't be absolutely sure that this is why CPython uses bytecode because I haven't asked the Python developers about it, but I would be truly surprised if it was any other reason. Compiling to bytecode and interpreting the bytecode is a classic and standard interpreter implementation technique and CPython itself is a pretty classic implementation of it.)

WhyCPythonBytecode written at 02:27:53; Add Comment

2014-05-09

Some uses for Python's 'named' form of string formatting

I expect that every Python programmer is familiar with Python's normal way of formatting strings with % and 'printf' style format specifications. Let's call this normal way of formatting things a 'positional' way, because it's based on the position of the arguments given to be formatted. But as experienced Python programmers know, this is not the only way you can set up your formatting strings; you can also set them up so that they pick out what to format where based on name instead of argument position. Of course to do this you need to somehow attach names to the arguments, which is done by giving % a dictionary instead of its usual tuple.

Here's what this looks like, for people who haven't seen it before:

print "%(fred)d %(barney)d" % {'fred': 1, 'bob': 2, 'barney': 3}

Note that not all keys in the dictionary need to be used in the format string, unlike with positional arguments.

There are two general uses for named string format specifications, both of which usually start in a situation where the format specification itself is variable. The simple and straightforward use is rearranging the order of what gets printed, which can really come in handy for things like translating messages into different languages (this is apparently a sufficiently common need that it got its own feature in Python 3's new string formatting stuff). The more complex use is to print only a subset of information from a larger collection of available information. Effectively this makes '%' string formatting into a little templating system.

My uses of this have tended to be towards full blown templating where the person configuring my program is trusted to write the formatting strings (note that this can at least throw exceptions if they get it wrong). I can see uses for this in simpler setups, for example to log a number of different messages with somewhat different information depending on some combination of things. Rather than write full blown and repetitive code to explicitly emit N variations of the same logging call, you could just select different name-based formatting strings based on the specific circumstances.

(I'll have to remember to experiment with this idea the next time I have this need. It feels like this might be an interesting new approach to deal with the whole issue of verbosity and including or not including certain bits of information and so on, which can otherwise clutter up the code something awful and be annoying to program.)

PS: Python 3's string formatting does this differently. Following my current policy on Python 3 I'm not thinking about it at all.

NamedFormattingUses written at 23:34:26; Add Comment

2014-04-27

Thoughts about Python classes as structures and optimization

I recently watched yet another video of a talk on getting good performance out of Python. One of the things it talked about was the standard issue of 'dictionary abuse', in this case in the context of creating structures. If you want a collection of data, the equivalent of a C struct, things that speed up Python will do much better if you say what you mean by representing it as a class:

class AStruct(object):
  def __init__(self, a, b, c):
    self.a = a
    self.b = b
    self.c = c

Even though Python is a dynamic language and AStruct instances could in theory be rearranged in many ways, in practice they generally aren't and when they aren't we know a lot of ways to speed them up and make them use minimal amounts of memory. If you instead just throw them into a dictionary, much less optimization is (currently) done.

(I suspect that many of these dynamic language optimizations could be applied to dictionary usage as well, it's just that people are hoping to avoid it for various reasons.)

My problem with this is that even small bits of extra typing tempt me into unwise ways to reduce it. In this early example I both skipped having an __init__ function and just directly assigned attributes on new instances and wrote a generic function to do it (this has a better version). This is all well and good in ordinary CPython, but now I have to wonder how far one can go before the various optimizers and JIT engines will throw up their hands and give up on clever things.

(I suspect that the straightforward __init__ version is easiest for optimizers to handle, partly because it's a common pattern that attributes aren't added to an instance after __init__ finishes.)

It's tempting to ask for standard library support for simple structures in the form of something that makes them easy to declare. You could do something like 'AStruct = structs.create('a', 'b', 'c')' and then everything would work as expected (and optimizers would have a good hook to latch on to). Unfortunately such a function is hard to create today in Python, especially in a form that optimizers like PyPy are likely to recognize and accelerate. Probably this is a too petty and limited wish.

PS: of course the simplest and easiest to optimize version today is just a class that just has a __slots__ and no __init__. PyPy et al are guaranteed that no other attributes will ever be set on instances, so they can pack things as densely as they want.

StructPerformanceThoughts written at 03:33:58; Add Comment

2014-04-14

My reactions to Python's warnings module

A commentator on my entry on the warnings problem pointed out the existence of the warnings module as a possible solution to my issue. I've now played around with it and I don't think it fits my needs here, for two somewhat related reasons.

The first reason is that it simply makes me nervous to use or even take over the same infrastructure that Python itself uses for things like deprecation warnings. Warnings produced about Python code and warnings that my code produces are completely separate things and I don't like mingling them together, partly because they have significantly different needs.

The second reason is that the default formatting that the warnings module uses is completely wrong for the 'warnings produced from my program' case. I want my program warnings to produce standard Unix format (warning) messages and to, for example, not include the Python code snippet that generated them. Based on playing around with the warnings module briefly it's fairly clear that I would have to significantly reformat standard warnings to do what I want. At that point I'm not getting much out of the warnings module itself.

All of this is a sign of a fundamental decision in the warnings module: the warnings module is only designed to produce warnings about Python code. This core design purpose is reflected in many ways throughout the module, such as in the various sorts of filtering it offers and how you can't actually change the output format as far as I can see. I think that this makes it a bad fit for anything except that core purpose.

In short, if I want to log warnings I'm better off using general logging and general log filtering to control what warnings get printed. What features I want there are another entry.

WarningsModuleReactions written at 01:19:06; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.