2018-07-23
Some notes on lifting Python 2 code into Python 3 code
We have a set of Python programs that are the core of our ZFS spares handling system. The production versions are written in Python 2 and run on OmniOS on our ZFS fileservers, but we're moving to ZFS-based Linux fileservers, so this code needed a tune-up to cope with the change in environment. As part of our decision to use Python 3 for future tools, I decided to change this code over to Python 3 (partly because I needed to write some completely new Python code to handle Linux device names).
This is not a rewrite or even a port; instead, let's call it lifting
code from Python 2 up to Python 3. Mechanically what I did is similar
to the first time I did this sort of shift,
which is that I changed the '#!/usr/bin/python
' at the start of
the programs to '#!/usr/bin/python3
' and then worked to fix
everything that Python 3 complained about. For this code, there have
only been a few significant things so far:
- changing all tabs to spaces, which I did with
expand
(and I think I overdid it, since I didn't use 'expand -i
'). - changing
print
statements intoprint()
calls. I learned the hard way to not overlook bare 'print
' statements; in Python 2 that produces a newline, while in Python 3 it's still valid but does nothing. - converting '
except CLS, VAR:
' statements to the modern form, as this code was old enough to have a number of my old Python 2 code habits. - taking
.sort()
s that used comparison functions and figuring out how to creatively generate sort keys that gave the same results. This opened my mind up a bit, although there are still nuances that using sort keys can't easily capture. - immediately
list()
-ifying most calls ofadict.keys()
, because that particular assumption was all over my code. There were a couple of cases that perhaps I could have deferred the list-ification to later (if at all), but this 'lifting' is intended to be brute force.(I didn't list-ify cases where I was clearly immediately iterating, such as 'for ... in d.keys()' or 'avar = [x for ... in d.keys()]'. But any time I assigned .keys() to a name or returned it, it got list-ified.)
- replace use of optparse with argparse. This wasn't strictly necessary (Python 3 still has optparse), but argparse is the future so I figured I'd fix things while I was working on the code anyway.
Although these tools do have a certain amount of IO, I could get away with relying on Python 3's default character set conversion rules; in practice they should only ever be dealing with ASCII input and output, and if they aren't something has probably gone terribly wrong (eg our ZFS status reporting program has decided to start spraying out binary garbage). This is fairly typical of internal-use system tools but not necessarily of other things, which can expose interesting character set conversion questions.
(My somewhat uninformed view is that character set conversion issues are where moving from Python 2 to Python 3 gets exciting. If you can mostly ignore them, as I could here, you have a much easier time. If you have to consider them, it's probably going to be more porting than just casually lifting the code into Python 3.)
For the most part this 2-to-3 lifting went well and was straightforward. It would have gone better if I had meaningful tests for this code, but I've always had problems writing tests for command line programs (and some of this code is unusually complex to test). I used pyflakes to try to help find Python 3 issues that I'd overlooked; it found some issues but not all of them, and it at least feels less thorough than pychecker used to be. What I would really like is something that's designed to look for lingering Python 2-isms that either definitely don't work in Python 3 or that might be signs of problems, but I suspect that no such tool exists.
(I tried pylint very briefly, but stopped when it had an explosion of gripes with no obvious way to turn off most of them. I don't care about style 'issues' in this code; I want to know about actual problems.)
I'm a bit concerned that there are lingering problems in the code,
but this is basically the tradeoff I get to make for taking the
approach of 'lifting' instead of 'porting'. Lifting is less work
if everything is straightforward and goes well, but it's not as
thorough as carefully reading through everything and porting it
piece by carefully considered piece (or using tests on everything).
I had to stumble over a few .sort()
s with comparison functions
and un-listified .keys()
, especially early on, which has made me
conscious that there could be other 2-to-3 issues I just haven't
hit in my test usage of the programs. That's one reason I'd like a
scanner; it would know what to look for (probably better than I do
right now) and as a program, it would look in all of the code's
corners.
PS: I remember having a so-so experience with 2to3
many years in
the past, but writing this entry got me to see what it did to the
Python 2 versions. For the most part it was an okay starting point,
but it didn't even flag uses of .sort()
with a comparison function
and it did significant overkill on list-ifying adict.keys()
.
Still, reading its proposed diffs just now was interesting. Probably
not interesting enough to get me to use it in the future, though.
The irritatingly many executable formats of Windows
So I tweeted:
It's impressive how many different executable file formats Windows has.
(I care because our email system wants to reject top-level attachments that are Windows 'executables' and boy is the list getting long.)
I put 'executables' into quotes in this tweet because many of these file formats (or more exactly file types) are not binaries; instead they're text files that Windows will feed to various things that will interpret them in ways that you don't want. Typical extensions that we see as top level attachments (and reject at SMTP time) include .lnk, .js, .bat, .com, .exe, .vbs, and .vbe. Some of these are encoded binaries, while others are text.
We mostly do this checking and rejection based on MIME file extensions, partly because it's easiest. Also, for the ones that are text (and at least some of the ones that are encoded binaries), my understanding is that what makes them dangerous on a Windows machine is their file extension. A suitable text file with the extension ".txt" will be opened harmlessly in some editor, while the same file with the extension ".js" will generally be run if you try to open it.
(We do some file content sniffing to look for and reject unlabeled Windows executables, ie things which libmagic will report as type 'application/x-dosexec'. As you can see here, there are actually a lot of (sub)formats that map to this.)
We've historically added extensions one at a time as we run into them, usually when our commercial anti-spam system rejects one of them as being a virus (this time, several .pif files being rejected as 'W32/Mytob-C'). Possibly this is the wrong approach and we should find a master list somewhere to get almost all of this over with at once (perhaps starting from GMail's list of blocked file types). On the other hand, there's some benefit to passing up rejections, especially if you don't actually seem to need them. If we never see file types, well, why block them?
(I'm not completely convinced by this logic, by the way. But I'm lazy and also very aware that I could spend all my time building intricate anti-spam precautions of dubious actual benefit.)