Wandering Thoughts archives

2015-04-26

My complicated feelings on abandoning old but good code

Yesterday I wrote about some twelve year old Python code I have that's still running unmodified, and commented about both how it's a bit sad that Python 2 has changed so little that this code doesn't even emit warnings and that this is because Python has moved on to Python 3. This raises the obvious question: am I going to move portnanny (this code) to Python 3? My current answer is that I'm not planning to, because portnanny is basically end of life and abandoned.

I don't consider the program end of life because it's bad code or code that I would do differently if I was rewriting it today. It's EOL for a simpler reason, namely that I don't really have any use for what it does any more. This particular code was first written to be in front of an NNTP server for Usenet and then actually mostly used to be the SMTP frontend of a peculiar MTA. I haven't had any NNTP servers for a long time now and the MTA that portnanny sits in front of is obsolete and should really be replaced (the MTA lingers on only because it's still simpler to leave it alone). If and when portnanny or the MTA break, it probably won't be worth fixing them; instead that will be my cue to replace the whole system with a modern one that works better.

All of this makes me sad, partly because portnanny handles what used to be an interesting problem but mostly because I think that portnanny is actually some of the best Python code I've written. It's certainly the best tested Python code I've written; nothing else comes close. When I wrote it, I tried hard to do a good job in both structure and implementation and to follow the mantras of test driven development for Python, and I'll probably never again write Python code that is as well done. Turning my back on the best Python code I may ever write isn't entirely a great feeling and there's certainly part of me that doesn't want to, however silly that is.

(It's theoretically possible I'll write Python code this good in the future, but something significant would have to change in my job or my life in order to make writing high quality Python an important part of it.)

There's a part of me that now wants to move portnanny to Python 3 just because. But I've never been able to get really enthused about programming projects without a clear use and need, and this would be just such a 'no clear use' make work project. Maybe I'll do it anyways someday, but the odds are not good.

(Part of the reason that the odds are not good is that I think the world has moved on from using tcpwrappers like user level filtering for access control, especially when it's embodied in external programs. So not only do I not really have a use for portnanny any more, I don't think anyone else would even if they knew about it.)

AbandoningOldGoodCode written at 01:50:03; Add Comment

2015-04-25

I'm still running a twelve year old Python program

I've been thinking for a while about the interesting and perhaps somewhat remarkable fact that I'm still running a twelve year old Python program (although not as intensely as I used to be). When I say twelve year old, I don't mean that the program was first written twelve years ago and is still running; I mean that the code hasn't been touched since 2004. For bonus points, it uses a compiled module and the source of the module hasn't been changed since 2004 either (although I have periodically rebuilt it, including moving it from 32-bit to 64-bit).

(I was going to say that the tests all still pass, but it turns out that I burned a now-obsolete IP address into one of them. Oops. Other than that, all the tests still pass.)

While the code uses old idioms (it entirely uses the two argument form of raise, for example), none of them are so old and questionable that Python 2 emits deprecation warnings for them. I'm actually a little bit surprised by that; even back in 2004 I was probably writing old fashioned code. Apparently it's still not too old fashioned.

Some of the long life of this old code can be attributed to the fact that Python 2 has been moving slowly. In 2004 I was writing against some version of Python 2.3, and the Python 2 line stopped really changing no later than Python 2.7 in 2010. Really, I doubt anyone was in a mood to deprecate very much in Python 2 after 2007 or so, and maybe earlier (I don't know when serious Python 3 development started; 3.0 was released at the end of 2008).

(With that said, Python 2.6 did deprecate some things, and there were changes in Python 2.4 and 2.5 as well.)

My impression is that this is a reasonably surprising lifespan for an unchanged set of source code, especially in an interpreted language. Even in a compiled language like C, I'd expect to have to update some function prototypes and stuff (never mind a move from 32 bits to 64 bits). While it's certainly been convenient for me in that I haven't had to pay any attention to this program and it just worked and worked even as I upgraded my system, I find myself a little bit sad that Python 2 has moved so slowly that twelve years later my code doesn't even get a single deprecation warning.

(The flip side argument is that my code would get plenty of warnings and explosions if I ran it on Python 3. In this view the language of Python as a whole has definitely moved on and I have just chosen to stay behind in a frozen pocket of oldness that was left for people like me, people who had old code they didn't want to bother touching.)

PS: It turns out that the Github repo is somewhat ahead of the state of the code I have running on my system. Apparently I did some updates when I set up the Github repo without actually updating the installed version. Such are the hazards of having any number of copies of code directories.

TwelveYearOldPythonProgram written at 01:15:28; Add Comment

2015-04-16

Are Python dictionaries necessarily constant-time data structures?

The general view of all forms of hash tables, Python's dictionaries included, is that they are essentially constant-time data structures under normal circumstances. This is not quite true under sufficiently perverse conditions where you have a high degree of collisions in the hashes of the keys, but let's assume that you don't have that for the moment. Ignoring hash collisions, can you treat dictionaries as fast constant-time data structures?

The answer is 'not always', and the path to it has some interesting and perhaps surprising consequences. The potential problem is custom hash functions. If the objects you're using as dictionary keys have a __hash__ method, this method must be called to return the object hash. This is Python code, so it's not necessarily going to be fast by comparison with regular dictionary operations. It may also take a visibly non-constant amount of time, depending on just how it's computing the hash.

(For instance, hashing even Python strings is actually not a constant time operation; it's linear with the length of the string. It's just that all of the code is in C, so by normal standards you're never going to notice the time differences.)

One of the things that you may want to consider as a result of this is memoizing the results of any expensive __hash__ method. This is what Python strings do; if you call hash() on a string, the actual hash computation is done only once and afterwards the already computed (and saved) hash value is just repeated back to you. This only works for things with immutable hash values, but then if your objects have hash values at all they should be immutable ones.

The real answer is that all of this is relatively theoretical. I'm pretty sure that almost no one uses complex custom __hash__ functions for objects defined in Python, although it seems relatively common to define simple ones that just delegate to the hash of another object (probably mostly or always a primitive object with a fast C level hash function). And if you do have objects with complex __hash__ functions that take noticeable amounts of time, you're probably not going to be using them as dictionary keys or set members very often because if you do, well, you'll notice.

On the other hand, the amount of work that the standard library's decimal.Decimal does in its __hash__ function is a little alarming (especially in Python 2.7). Having looked, I wouldn't encourage using them as dictionary keys or set members any time soon, at least not in high-volume dictionaries or sets. The Python 3 version of datetime is another potentially interesting case, since it does a certain amount of grinding away in Python __hash__ functions.

(In Python 2.7, datetime is a C-level module so all of its hashing operations presumably go really fast in general.)

Sidebar: Custom hashes and the Global Interpreter Lock

Let's ask another question: is adding a new key and value to a dictionary an atomic operation that's inherently protected by the GIL? After all, the key might have a custom __hash__ function that runs Python code (and thus bytecode) during any dictionary operation. As far as I can tell from peering at the CPython code, the answer is more or less yes. Although dictionary or set operations may require calling Python code for __hash__ (and for that matter for custom __eq__ methods as well), this is all done conceptually 'before' the actual dictionary modification takes place. The actual modification happens all at once, so you'll never see a dictionary with eg a key set but not its corresponding value.

This does mean that writing 'dct[ky] = val' may involve much more Python bytecode running than you expect (and thus a much higher chance that Python switches to another thread before the new key and value are added to the dictionary). But it's always been the case that Python might switch to another thread at almost any bytecode, so this hasn't created a new race, just widened the time window of an existing one you already had.

DictHashingComplexity written at 00:28:20; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.