Wandering Thoughts

2018-11-16

Restisting the temptation to rely on Ubuntu for Django 1.11

One of the things that is on my mind is what to do about our Django web application as far as Python 3 goes. Right now it's Python 2, and even apart from people trying to get rid of Python 2 in general, the Django people have been quite explicit that Django 1.11 is the last version that will support Python 2 and that support for it will end in 2020 (probably 'at the start of 2020' in practice). Converting it to Python 3 is getting more and more urgent, but at the same time this is going to be a bunch of grinding work (I still haven't added any tests to it, for example).

The host that our Django web app runs on was recently upgraded to Ubuntu 18.04 LTS, so the other day I idly checked the version of Django that 18.04 packages; this turns out to be Django 1.11 (for both Python 2 and Python 3; Django 2.0 for Python 3 might just have missed Ubuntu's cutoff point, since it was only released at the end of 2017). Ubuntu 18.04 LTS will be supported for five years and Ubuntu never does the sort of major version updates that going from 1.11 to 2.x would be, so for a brief moment I had the great temptation to switch over to the Ubuntu 18.04 packaged version of Django 1.11 and then forgetting about the problem until 2022 or so.

Then I came to my senses, because Ubuntu barely fixes bugs and security issues at the best of times. To my surprise, Ubuntu actually has Django in their 'main' repo, which is theoretically fully supported, but in practice I don't really believe that Canonical will really be spending very much effort to keep Django 1.11 secure after the upstream Django developers drop support for it. No later than 2020, the Ubuntu 18.04 LTS version of Django 1.11 is very likely to become, effectively, abandonware. Unless we feel very confident that Django 1.11 will be completely secure at that point in our configuration, we should not keep running it (especially since a small portion of the application is exposed to the Internet).

(I wouldn't be surprised if Canonical backported at least some easy security fixes from 2.x to 1.11 after 2020. But I would be surprised to see them do any significant programming work for code that's significantly different between 1.11 and the current 2.x or for 1.11-specific issues.)

However much I'd like to ignore the issue for as long as possible or let myself believe that it can be someone else's issue, dealing with this is in my relatively immediate future. We just have to move our Django web app to Python 3 and Django 2.x, even though it's going to be at least a bit of a grind. Probably I should try to do it bit by bit, for example by spending even just an hour or a half hour a week adding a test or two to the current code.

(Part of why I feel so un-motivated is that we're going to have to invest a bunch of effort to wind up exactly where we are currently. The app works perfectly well as it is and we don't want anything that's in newer Django versions; we're upgrading purely to stay within the version coverage of security fixes. This is, sadly, a bunch of make-work.)

DjangoUbuntuLTSBadIdea written at 22:46:42; Add Comment

2018-11-12

What Python 3 versions I can use (November 2018 edition)

Back several years ago, I did a couple of surveys of what Python versions I could use for both Python 2 and Python 3, based on what was available on the platforms that we (and I) use. What Python 2 versions are available is almost irrelevant to me now; everything I still care about has a sufficiently recent version of 2.7, and anyway I'm moving to Python 3 for new code both personally and for work. So the much more interesting question is what versions of Python 3 are out there, or at least what major versions. Having gone through this exercise, my overall impression is that the Python 3 version landscape has stabilized for the uses that we currently make of Python 3.

At this point, a quick look at the release dates of various Python 3 versions is relevant. Python 3.4 was released March 16, 2014; 3.5 was released September 13, 2015; 3.6 was released December 23, 2016; 3.7 was only released this June 27, 2018. At this point, anyone using 3.7 on Unix is either using a relatively leading edge Unix distribution or built it themselves (I think it just got into Fedora 29 as the default 'Python 3', for example). However, I suspect that 3.6 is the usual baseline people developing Python 3 packages assume and target, perhaps with some people still supporting 3.5.

At work, we mostly have a mixture of Ubuntu LTS versions. The oldest one is Ubuntu 14.04; it's almost gone but we still have two last 14.04 servers for a couple more months and I actually did write some new Python 3 code for them recently. The current 14.04 Python 3 is 3.4.3, which is close enough to modern Python 3 that I didn't run into any problems in my simple code, but I wouldn't want to write anything significant or tricky that had to run in Python 3 on those machines.

(When I started writing the code, I actually asked myself if I wanted to fall back to Python 2 because of how old these machines were. I decided to see if Python 3 would still work well enough, and it did.)

We have a bunch of Ubuntu 16.04 machines that will be staying like that until 2020 or so, when 16.04 starts falling out of support. Ubuntu 16.04 currently has 3.5.2, and the big feature it doesn't have that I'm likely to run into is probably literal string interpolation; I can avoid it in my own code, but not necessarily in any third party modules I want to use. Until recently, the 16.04 Python 3.5 was the Python 3 that I developed to and most actively used, so it's certainly a completely usable base for our Python 3 code.

Ubuntu 18.04 has Python 3.6.6, having been released a few months before 3.7. I honestly don't see very much in the 3.7 release notes that I expect to actively miss, although a good part of this is because we don't have any substantial Python programs (Python 3 or otherwise). If we used asyncio, for instance, I think we'd care a lot more about not having 3.7.

We have one CentOS 6 machine, but it's turning into a CentOS 7 machine some time in the next year and we're not likely to run much new Python code on it. However, just as back in 2014, CentOS 7 continues to have no version of Python 3 in the core package set. Fortunately we don't need to run any of our new Python 3 programs on our CentOS machines. EPEL has Python 3.4.9 and Python 3.6.6 if we turn out to need a version of Python 3 (CentOS maintains a wiki page on additional repositories).

My own workstation runs Fedora, which is generally current or almost current (depending on when Fedora releases happen and when Python releases happen). I'm currently still on Fedora 28 as I'm waiting for Fedora 29 to get some more bugs fixed. I have Python 3.6.6 by default and I could get Python 3.7 if I wanted it, and my default Python 3 will become 3.7 when I move to Fedora 29.

The machine currently hosting Wandering Thoughts is running FreeBSD 10.4 at the moment, which seems to have Python 3.6.2 available through the Ports system. However, moving DWiki (the Python software behind the blog) to Python 3 isn't something that I plan to do soon (although the time is closer than it was back in 2015). My most likely course of action with DWiki is to see what the landscape looks like for Python 2 starting in 2020, when it's formally no longer supported (and also what the landscape looks like for Python 3, for example if there are prospects of significant changes or if things appear to have quieted down).

(Perhaps I should start planning seriously for a Python 3 version of DWiki, though. 2020 is not that far away now and I don't necessarily move very fast with personal projects these days, although as usual I expect Python 2 to be viable and perfectly good for well beyond then. I probably won't want to write code in Python 2 any more by then, but then I'm not exactly modifying DWiki much right now.)

MyPython3Versions2018-11 written at 22:56:09; Add Comment

2018-10-28

The obviousness of inheritance blinded me to the right solution

This is a Python programming war story.

I recently wrote a program to generate things to drive low disk space alerts for our ZFS filesystems in our in-progress Prometheus monitoring system. ZFS filesystems are grouped together into ZFS pools, and in our environment it makes sense to alert on low free space in either or both (ZFS filesystems can run out of space without their pool running out of space). Since we have a lot of filesystems and many fewer pools, it also makes sense to be able to set a default filesystem alert level on a per-pool basis (and then perhaps override it for specific filesystems). The actual data that drives Prometheus must be on a per-object basis, so one thing the program has to do is expand those default alert levels out to be specific alerts for every filesystem in the pool without a specific alert level.

When I began coding the Python to parse the configuration file and turn it into a data representation, I started by thinking about the data representation. It seemed intuitively clear and obvious that a ZFS pool and a ZFS filesystem are almost the same thing, except that a ZFS pool has a bit more information, and therefor they should be in a general inheritance relationship with a fundamental base class (written here using attrs):

@attr.s
class AlertObj:
  name = attr.ib()
  level = attr.ib()
  email = attr.ib()

class FSystem(AlertObj):
  pass

@attr.s
class Pool(AlertObj):
  fs_level = attr.ib()

I wrote the code and it worked, but the more code I wrote, the more awkward things felt. As I got further and further in, I wound up adding ispool() methods and calling them here and there, and there was a tangle of things operating on this and that. It all just felt messy. Something was wrong but I couldn't really see what at the time.

For unrelated reasons, we wound up wanting to significantly revise how we drove low disk space alerts and rather than modify my first program, I opted to start over from scratch. One reason for this was because with the benefit of a little bit of distance from my own code, I could see that inheritance was the wrong data model for my situation. The right natural data representation was to have two completely separate sets of objects, one set for directly set alert levels, which lists both pools and filesystems, and one for default alert levels (which only contains pools because they're the only thing that creates default alert levels). The objects all have the same attributes (they only need name, level, and email).

This made the processing logic much simpler. Parsing the configuration file returns both sets of objects, the direct set and the defaultable set. Then we go through the second set and for each pool entry in it, we look up up all of the filesystems in that pool and add them to the first set if they aren't already there. There is no Python inheritance in sight and everything is obviously right and straightforward.

In the new approach, it would also be relatively easy to add default alert levels that are driven by other sorts of things, for instance an idea of who owns a particular entity (pools are often owned collectively by groups, but individual filesystems may be 'owned' and used by specific people, some of whom may not care unless their filesystems are right out of space). The first version's inheritance-based approach would have just fell over in the face of this; a default alert level based on ownership has no 'is-sort-of-a' relationship with ZFS filesystems or pools at all.

I've always known that inheritance wasn't always the right answer, partly because I have the jaundiced C programm's view of object orientation; all of OO's fundamental purpose is to make my code simpler, and if it doesn't do that I don't use it. In theory this should have made me skip inheritance here; in practice, inheritance was such an obvious and shiny hammer that once I saw some of it, I proceeded to hit all of my code with it no matter what.

(If nothing else, the whole experience serves me as a useful learning experience. Maybe the next time around I will more readily listen to the feeling that my code is awkward and maybe something is wrong.)

BlindedByInheritance written at 00:49:22; Add Comment

2018-10-25

I should always give my Python classes a __str__ method

I have been going back and forth between Python and Go lately, and as part of that I have (re-)learned a sharp edged lesson about working in Python because of something that Go has built in that Python doesn't.

I do much of my debugging via print() statements or the equivalent. One of the conveniences of Go is that its formatted output package has built-in support for dumping structures. If you have a structure, and usually you do because they're often the Go equivalent of instances of classes, you can just tell fmt.Printf() to print the whole thing out for you with all the values and even the field names.

If you try this trick with a plain ordinary Python class that you've knocked together, what you get is of course:

>>> f = SomeClass("abc", 10)
>>> print(f)
<__main__.SomeClass object at 0x7f4b1f3c7fd0>

To do better, I need to implement a __str__ method. When I'm just putting together first round code to develop my approach to the problem and prove my ideas, it's often been very easy for me to skip this step; after all, I don't need that __str__ method to get my code working. Then I go to debug my code or, more often, explore how it's working in the Python interpreter and I discover that I really could use the ability to just see the insides of my objects without fishing around with dir() and direct field access and so on.

By the time I'm resorting to dir() and direct field access in the Python REPL, I'm not exactly doing print-based debugging any more. Running into this during exploration is especially annoying; I'll call a routine I've just written and I'm now testing, and I'll get back some almost opaque blobs. I could peer inside them, but it's especially annoying because I know I've done this to myself.

As the result of writing some Python both today and yesterday, today's Python resolution is that I'll write basic __str__ methods for all of my little data-holding classes. It only takes a minute or two and it will make my life significantly better.

(If I'm smart I'll make that __str__ follow some useful standard form instead of being clever and making up a format that is specific to the type of thing that I'm putting in a class. There are some times when I want a class-specific __str__ format, but in most cases I think I can at least live with a basically standard format. Probably I should copy what attrs does.)

PS: collections.namedtuple() is generally not what I want for various reasons, including that I'm often going to mutate the fields of my instance objects after they've been created.

Sidebar: Solving this problem with attrs

If I was or am willing to use attrs (which I have pragmatic concerns with for some code), it will solve this problem for me with no fuss or muss:

>>> @attr.s
... class SomeClass:
...    barn = attr.ib()
...    fred = attr.ib()
... 
>>> f = SomeClass("abc", 10)
>>> print(f)
SomeClass(barn='abc', fred=10)

I'm not quite sure that this will get me to use attrs all by itself, but I admit that it's certainly tempting. Attrs is even available as a standard package in Ubuntu 18.04 (with what is a relatively current version right now, 17.4.0 from the end of 2017).

I confess that I now really wish attrs was in the Python standard library so that I could use it without qualms as part of 'standard Python', just as I feel free to use things like urllib and json.

GivingClassesAStr written at 23:43:15; Add Comment

2018-10-16

Quickly bashing together little utilities with Python is nice

One of the reasons that I love Python, and one of the things that I love using it for, is its ability to quickly and easily bash together little utility programs. By their nature, these things don't get talked about very much (they're so small and often so one-off), but this time I have a couple of examples to talk about.

As part of writing yesterday's entry on external email delivery delays we see, I found myself wanting to turn Exim's human-friendly format that it reports time durations in into an integer number of seconds, so that I could sort it, and then later on I found myself wanting to convert the other way, so that I could tell what the few high number of seconds I was getting turned into in human-readable terms.

Exim's queue delay format looks like '1d19h54m13s'. The guts of my little Python program to convert these into seconds looks like this:

rexp = re.compile("([a-z])")
timemap = {'s': 1,
           'm': 60,
           'h': 60*60,
           'd': 24*60*60,
           }

def process():
    for l in sys.stdin:
        sr = rexp.split(l)
        # The minimum split should be '1', 's', '\n'.
        if len(sr) < 3:
            continue
        secs = 0
        for i in range((len(sr)-1) // 2):
            o = i*2
            secs += int(sr[o]) * timemap[sr[o+1]]
        print(secs)

The core trick is to use a Python regexp to split '1d19h54m13s' into ['1', 'd', '19', 'h', '54', 'm', '13', 's'] (plus a trailing newline in this case). We can then take pairs of these things, turn the first into a number, and multiply it by the unit conversion determined by the second.

Going the other direction looks surprisingly similar (for example, I literally copied the timemap over):

timemap = {'s': 1,
           'm': 60,
           'h': 60*60,
           'd': 24*60*60,
           }

def tsstring(secs):
    o = []
    for i in ('d', 'h', 'm', 's'):
        if secs >= timemap[i] or o:
            n = secs // timemap[i]
            o.append("%d%s" % (n, i))
            secs %= timemap[i]
    return "".join(o)

There are probably other, more clever ways to do both conversions, as well as little idioms that could make these shorter and perhaps more efficient. But one of the great quiet virtues of Python is that I didn't need to reach for any special idioms to bash these together. The only trick is the regular expression split and subsequent pairing in the first program. Everything else I just wrote out as the obvious thing to do.

Neither of these programs worked quite right the first time I wrote them, but in both cases they were simple enough that I could realize my oversight by staring at things for a while (and it didn't take very long). Neither needed a big infrastructure around them, and with both I could explore their behavior and the behavior of pieces of them interactively in the Python interpreter.

(Exploring things in the Python interpreter was where I discovered that the .split() was going to give me an odd number of elements no matter what, so I realized that I didn't need to .strip() the input line. I'm sure that people who work with RE .split() in Python know this off the top of their head, but I'm an infrequent user at this point.)

Neither of these programs have any error handling, but neither would really be improved in practice by having to include it. They would be more technically correct, but I should never feed these bad input and if I do, Python will give me an exception despite me skipping over the entire issue. That's another area where Python has an excellent tradeoff for quickly putting things together; I don't have to spend any time on error handling, but at the same time I can't blow my foot off (in the sense of quietly getting incorrect results) if an error does happen.

(I almost started writing the Exim format to seconds conversion program in Go, except when I looked it up time.ParseDuration() doesn't accept 'd' as a unit. I think writing it in Python was easier overall.)

Sidebar: Code bumming and not thinking things through

I knew I would be processing a decent number of Exim delay strings in my program, so I was worried about it being too slow. This led me to write the core loop in the current not quite obvious form with integer indexing, when it would be more straightforward to iterate through the split list itself, pulling elements off the front until there wasn't enough left. There are alternate approaches with an explicit index.

Looking at this now the day afterward, I see that I could have made my life simpler by remembering that range() takes an iteration step. That would make the core loop:

        for i in range(0, len(sr)-1, 2):
            secs += int(sr[i]) * timemap[sr[i+1]]

This would also perform better. Such are the perils of paying too close attention to one thing and not looking around to think about if there's a better, more direct way.

PythonQuickUtilsNice written at 21:51:42; Add Comment

2018-09-17

Python 3 supports not churning memory on IO

I am probably late to this particular party, just as I am late to many Python 3 things, but today (in the course of research for another entry) I discovered the pleasant fact that Python 3 now supports read and write IO to and from appropriate pre-created byte buffers. This is supported at the low level and also at the high level with file objects (as covered in the io module).

In Python 2, one of the drawbacks of Python for relatively high performance IO-related code was that reading data always required allocating a new string to hold it, and changing what you were writing also required new strings (you could write the same byte string over and over again without memory allocation, although not necessarily a Unicode string). Python 3's introduction of mutable bytestring objects (aka 'read-write bytes-like objects') means that we can bypass both issues now. With reading data, you can read data into an existing mutable bytearray (or a suitable memoryview), or a set of them. For writing data, you can write a mutable bytestring and then mutate it in place to write different data a second time. This probably doesn't help much if you're generating entirely new data (unless you can do it piece by piece), but is great if you only need to change a bit of the data to write a new chunk of stuff.

One obvious question here is how you limit how much data you read. Python modules in the standard library appear to have taken two different approaches to this. The os module and the io module use the total size of the pre-allocated buffer or buffers you've provided as the only limit. The socket module defaults to the size of the buffer you provide, but allows you to further limit the amount of data read to below that. This initially struck me as odd, but then I realized that network protocols often have situations where you know you want only a few more bytes in order to complete some element of a protocol. Limiting the amount of data read below the native buffer size means that you can have a single maximum-sized buffer while still doing short reads if you only want the next N bytes.

(If I'm understanding things right, you could do this with a memoryview of explicitly limited size. But this would still require a new memoryview object, and they actually take up a not tiny amount of space; sys.getsizeof() on a 64-bit Linux machine says they're 192 bytes each. A bytearray's fixed size is actually smaller, apparently coming in at 56 bytes for an empty one and 58 bytes for one with a single byte in it.)

Sidebar: Subset memoryviews

Suppose you have a big bytearray object, and you want a memoryview of the first N bytes of it. As far as I can see, you actually need to make two memoryviews:

>>> b = bytearray(200)
>>> b[0:4]
bytearray(b'\x00\x00\x00\x00')
>>> m = memoryview(b)
>>> ms = m[0:30]
>>> ms[0:4] = b'1234'
>>> b[0:4]
bytearray(b'1234')

It is tempting to do 'memoryview(b[0:30])', but that creates a copy of the bytearray that you then get a memoryview of, so your change doesn't actually change the original bytearray (and you're churning memory). Of course if you intend to do this regularly, you'd create the initial memoryview up front and keep it around for the lifetime of the bytearray itself.

I'm a little bit surprised that memoryview objects don't have support for creating subset views from the start, although I'm sure there are good reasons for it.

Python3MutableBufferIO written at 23:32:23; Add Comment

2018-09-16

CPython has a fairly strongly predictable runtime, which can be handy

I recently needed a program to test and explore some Linux NFS client behavior (namely, our recent NFS issue). Because this behavior depended on what user-level operations the kernel saw, I needed to be very specific about what system calls my test setup made, in what order, and so on. I also wanted something that I could rapidly put together and easily revise and alter for experiments, to see just what sequence of (system call) operations were necessary to cause our issues. In a way the obvious language to write this in would be C, but instead I immediately turned to Python.

Beyond the speed of writing things in Python, the obvious advantage of Python here is that the os module provides more or less direct access to all of the system calls I wanted (ultimately mixed with the fcntl module in order to get flock()). Although Python normally works with file objects, which are abstracted, the os module gives you almost raw access to Unix file descriptors and the common operations on them, which map closely to system calls.

That latter bit is important, and leads to the subtle thing. Although the os module's documentation doesn't quite promise it directly, the operations it exposes translate almost completely directly to Unix system calls, and CPython's interpreter runtime doesn't alter them or add others intermixed into them (well, not others related to the files and so on that you're working on; it may do operations like request more memory, although probably not for simple test code). This means that you can write a fair amount of code using the os module (and fcntl, and a few others) that deal with raw Unix file descriptors (fds) and be pretty confident that Python is doing exactly what you asked it to and nothing else.

This is something you get with C, of course, but it's not something you can always say about other language runtimes. For test programs like what I needed, it can be a quite handy sort of behavior. I already knew CPython behaved like this from previous work, which is why I was willing to immediately turn to it for my test program here.

(If you're sufficiently cautious, you'll want to verify the behavior with a system call tracer, such as strace on Linux. If you do, it becomes very useful that the CPython runtime makes relatively few system calls that you didn't ask it to make, so it's easy to find and follow the system calls produced by your test code. Again, some language runtime environments are different here; they may have a churn of their own system calls that are used to maintain background activities, which clutter up strace output and so on.)

CPythonPredictableSyscalls written at 01:03:59; Add Comment

2018-08-24

Incremental development in Python versus actual tests

After writing yesterday's entry on how I need to add tests to our Djanog app, I've been thinking about why it doesn't have them already. One way to describe the situation is that I didn't bother to write any tests when I wrote the app, but another view is that I didn't write tests because I didn't need to. So let me explain what I mean by that.

When I ran into the startup overhead of small Python programs, my eventual solution was to write a second implementation in Go, which was kind of an odd experience (as noted). One of the interesting differences between the two versions is that the Go version has a fair number of tests and the Python one doesn't have any. There are a number of reasons for this, but one of them is that in Go, tests are often how you interact with your code. I don't mean that philosophically; I mean that concretely.

In Python, if you've written some code and you want to try it out to see if it at least sort of works, you fire up the Python interpreter, do 'import whatever' (even for your main program), and start poking away. In Go, you have no REPL, so often the easiest way to poke at some new code is to open up your editor, write some minimal code that you officially call a 'test', and run 'go test' to invoke it (and everything else you have as a test). This is more work than running an interactive Python session and it's much slower to iterate on 'what happens if ...' questions about the code, but it has the quiet advantage that it's naturally persistent (since it's already in a file).

This is the sense in which I didn't need tests to write our Django app. As I was coding, I could use the Python REPL and then later all of Django's ready to go framework itself to see if my code worked. I didn't have to actually write tests in order to test my code, not in the way that you can really need to in Go. In Python, incremental development can easily be done with all of your 'tests' being ad hoc manual work that isn't captured or repeatable.

(Even in Go, my testing often trails off dramatically the moment I have enough code written that I can start running a command to exercise things. In the Go version of my Python program, basically all of the tests are for low-level things and I 'tested' to see if higher level things worked by running the program.)

PS: Django helps this incremental development along by making it easy to connect bits of your code up to a 'user interface' in the form of a web page. You need somewhat more than a function call in the REPL but not much more, and then you can use 'manage runserver ...' to give you a URL you can use to poke your code, both to see things rendered and to test form handling. And sometimes you can check various pieces of code out just through the Django admin interface.

PPS: Of course it's better to do incremental development by writing actual tests. But it takes longer, especially if you don't already know how to test things in the framework you're using, as I didn't when I was putting the app together (cf).

PythonREPLAndTests written at 01:11:54; Add Comment

2018-08-23

It's time for me to buckle down and add tests to our Django web app

Our Django web app is the Python 2 code that I'm most concerned about in a world where people are trying to get rid of Python 2, for two reasons. First, not only do we have to worry about Python 2 itself being available but also the Django people have been quite explicit that Django 1.11 is the last version that supports Python 2 and the Django people will stop supporting it in 2020. We probably don't want to be using an unsupported web framework. Second, it's probably the program that's most exposed to character set conversion issues, simply because that seems to be in the nature of things that deal with the web, databases, and so on. In short, we've got to convert it to Python 3 sometime, probably relatively soon, and it's likely to be more challenging than other conversions we've done.

One of the things that would make a Python 3 conversion less challenging is if we had (automated) tests for the app, ideally fairly comprehensive ones. Having solid tests for your code is best practices for a Python 3 conversion for good reasons, and they'd also probably help with things like our Django upgrade pains. Unfortunately we've never had them, which was something I regretted in 2014 and hasn't gotten any better since then, because there's never been a time when adding tests was either a high enough priority or something that I was at all enthused about doing.

(One of my reasons for not feeling enthusiastic is that I suspect that trying to test the current code would lead me to have to do significant revisions on it in order to make it decently testable.)

Looking at our situation, I've wound up feeling that it's time for this to change. Our path forward with the Django app should start with adding tests, which will make both Python 3 and future Django upgrades (including to 1.11 itself) less risky, less work, and less tedious (since right now I do all testing by hand).

(Hopefully adding tests will have other benefits for future development and so on, but some of these are contingent on additional factors beyond the scope of this entry.)

Unfortunately, adding tests to this code is likely to feel like make-work to me, and in a sense it is; the code already works (yes, as far as we know), so all that tests do is confirm that it does. I have no features to add, so I can't add tests to cover the new features as I add them; instead, this is going to have to be just grinding out tests for existing code. Still, I think it needs to be done, and the first step for doing it is for me to learn how to test Django code, starting by reading the documentation.

(This entry is one of the ones that I'm writing in large part as a marker in the ground for myself, to make it more likely that I'll actually carry through on things. This doesn't always work; I still haven't actually studied dpkg and apt, despite declaring I was going to two years ago and having various tabs open on documentation since then. I've even read bits of the documentation from time to time, and then all of the stuff I've read quietly falls out of my mind again. The useful bits of dpkg I've picked up since then have come despite this documentation, not because of it. Generally they come from me having some problem and stumbling over a way to fix it. Unfortunately, our problems with our Django app, while real, are also big and diffuse and not immediate, so it's easy to put them off.)

DjangoWeNeedTests written at 01:38:11; Add Comment

2018-07-23

Some notes on lifting Python 2 code into Python 3 code

We have a set of Python programs that are the core of our ZFS spares handling system. The production versions are written in Python 2 and run on OmniOS on our ZFS fileservers, but we're moving to ZFS-based Linux fileservers, so this code needed a tune-up to cope with the change in environment. As part of our decision to use Python 3 for future tools, I decided to change this code over to Python 3 (partly because I needed to write some completely new Python code to handle Linux device names).

This is not a rewrite or even a port; instead, let's call it lifting code from Python 2 up to Python 3. Mechanically what I did is similar to the first time I did this sort of shift, which is that I changed the '#!/usr/bin/python' at the start of the programs to '#!/usr/bin/python3' and then worked to fix everything that Python 3 complained about. For this code, there have only been a few significant things so far:

  • changing all tabs to spaces, which I did with expand (and I think I overdid it, since I didn't use 'expand -i').

  • changing print statements into print() calls. I learned the hard way to not overlook bare 'print' statements; in Python 2 that produces a newline, while in Python 3 it's still valid but does nothing.

  • converting 'except CLS, VAR:' statements to the modern form, as this code was old enough to have a number of my old Python 2 code habits.

  • taking .sort()s that used comparison functions and figuring out how to creatively generate sort keys that gave the same results. This opened my mind up a bit, although there are still nuances that using sort keys can't easily capture.

  • immediately list()-ifying most calls of adict.keys(), because that particular assumption was all over my code. There were a couple of cases that perhaps I could have deferred the list-ification to later (if at all), but this 'lifting' is intended to be brute force.

    (I didn't list-ify cases where I was clearly immediately iterating, such as 'for ... in d.keys()' or 'avar = [x for ... in d.keys()]'. But any time I assigned .keys() to a name or returned it, it got list-ified.)

  • replace use of optparse with argparse. This wasn't strictly necessary (Python 3 still has optparse), but argparse is the future so I figured I'd fix things while I was working on the code anyway.

Although these tools do have a certain amount of IO, I could get away with relying on Python 3's default character set conversion rules; in practice they should only ever be dealing with ASCII input and output, and if they aren't something has probably gone terribly wrong (eg our ZFS status reporting program has decided to start spraying out binary garbage). This is fairly typical of internal-use system tools but not necessarily of other things, which can expose interesting character set conversion questions.

(My somewhat uninformed view is that character set conversion issues are where moving from Python 2 to Python 3 gets exciting. If you can mostly ignore them, as I could here, you have a much easier time. If you have to consider them, it's probably going to be more porting than just casually lifting the code into Python 3.)

For the most part this 2-to-3 lifting went well and was straightforward. It would have gone better if I had meaningful tests for this code, but I've always had problems writing tests for command line programs (and some of this code is unusually complex to test). I used pyflakes to try to help find Python 3 issues that I'd overlooked; it found some issues but not all of them, and it at least feels less thorough than pychecker used to be. What I would really like is something that's designed to look for lingering Python 2-isms that either definitely don't work in Python 3 or that might be signs of problems, but I suspect that no such tool exists.

(I tried pylint very briefly, but stopped when it had an explosion of gripes with no obvious way to turn off most of them. I don't care about style 'issues' in this code; I want to know about actual problems.)

I'm a bit concerned that there are lingering problems in the code, but this is basically the tradeoff I get to make for taking the approach of 'lifting' instead of 'porting'. Lifting is less work if everything is straightforward and goes well, but it's not as thorough as carefully reading through everything and porting it piece by carefully considered piece (or using tests on everything). I had to stumble over a few .sort()s with comparison functions and un-listified .keys(), especially early on, which has made me conscious that there could be other 2-to-3 issues I just haven't hit in my test usage of the programs. That's one reason I'd like a scanner; it would know what to look for (probably better than I do right now) and as a program, it would look in all of the code's corners.

PS: I remember having a so-so experience with 2to3 many years in the past, but writing this entry got me to see what it did to the Python 2 versions. For the most part it was an okay starting point, but it didn't even flag uses of .sort() with a comparison function and it did significant overkill on list-ifying adict.keys(). Still, reading its proposed diffs just now was interesting. Probably not interesting enough to get me to use it in the future, though.

LiftingPython2ToPython3 written at 23:56:44; Add Comment

(Previous 10 or go back to July 2018 at 2018/07/15)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.