Wandering Thoughts


How to handle Unicode character decoding errors depends on your goals

In a comment on my entry mulling over DWiki's Python 3 Unicode issues and what I plan to do about them, Sean A. asked a very good question about how I'm planning to handle errors when decoding things from theoretical UTF-8 input:

Out of curiosity, why use backslashreplace instead of surrogateescap? (I ask because it seems to me that surrogateescape also loses no information, is guaranteed to work with any binary input, and is designed for reading unknown encodings.)

Oh. And is trivial to convert back into the original binary data.

The reason I think I want Python's 'backslashreplace' error handling instead of 'surrogateescape' is that my ultimate goal is not to reproduce the input (in all its binary glory) in my output, but to produce valid UTF-8 output (for HTML, Atom syndication feeds, and so on) even if some of the input isn't valid.

(Another option is to abort processing if the input isn't valid, which is not what I want. It would be the most conservative and safe choice in some situations.)

Given that I'm going to produce valid UTF-8 no matter what, the choice comes down to what generates more useful results for the person reading what was invalid UTF-8 input. You can certainly do this with 'surrogateescape' by just encoding to straight UTF-8 using the 'surrogatepass' handler, but the resulting directly encoded surrogate characters are not going to show up as anything useful and might produce outright errors from some things (and possibly be misinterpreted under some circumstances).

(With 'surrogateescape', bad characters are encoded to U+DC80 to U+DCFF, which is the 'low' part of the Unicode surrogates range. As Wikipedia notes, 'isolated surrogate code points have no general interpretation', and certainly they don't have a distinct visual representation.)

Out of all of Python's available codecs error handlers that can be used when decoding from UTF-8 to Unicode, 'backslashreplace' is the one that preserves the most information in a visually clear manner while still allowing you to easily produce valid UTF-8 output that everyone is going to accept. The 'replace' handler has the drawback of making all invalid characters look the same and so leaves you with no clues as to what they look like in the input, and 'ignore' just tosses them away entirely, leaving everyone oblivious to the fact that bad characters were there in the first place.

(In some situations this makes 'ignore' the right choice, because you may not want to give people any marker that something is wrong; such a marker might only confuse them about something they can't do anything about. But since I'm going to be looking at the rendered HTML and so on myself, I want to have at least a chance to know that DWiki is seeing bad input. And 'replace' has the advantage that it's visible but is less peculiar and noisy than 'backslashreplace'; you might use it when you want some visual marker present that things are a bit off, but don't want to dump a bucket of weird backslashes on people.)

PS: This does mean that my choice here is a bit focused on what's useful for me. For me, having some representation of the actual bad characters visible in what I see gives me some idea of what to look for in the page source and what I'm going to have to fix. For other people, it's probably more going to be noise.

UnicodeDecodeErrorChoice written at 01:19:32; Add Comment


Two annoyances I have with Python's imaplib module

As I mentioned yesterday, I recently wrote some code that uses the imaplib module. In the process of doing this, I wound up experiencing some annoyances, one of them a traditional one and one a new one that I've only come to appreciate recently.

The traditional annoyance is that the imaplib module doesn't wrap errors from other modules that it uses. This leaves you with at least two problems. The first is that you get to try to catch a bunch of exception classes to handle errors:

  c = ssl.create_default_context()
  m = imaplib.IMAP4_SSL(host=host, ssl_context=c)
except (imaplib.IMAP4.error, ssl.SSLError, OSError) as e:

The second is that, well, I'm not sure I'm actually catching all of the errors that calling the imaplib module can raise. The module doesn't document them, and so this list is merely the ones that I've been able to provoke in testing. This is the fundamental flaw of not wrapping exceptions that I wrote about many years ago; by not wrapping exceptions, you make what modules you call an implicit part of your API. Then you usually don't document it.

I award the imaplib module bonus points for having its error exception class accessed via an attribute on another class. I'm sure there's a historical reason for this, but I really wish it had been cleaned up as part of the Python 3 migration. In the current Python 3 source, these exception classes are actually literally classes inside the IMAP4 class:

class IMAP4:
  class error(Exception): pass
  class abort(error): pass
  class readonly(abort): pass

The other annoyance is that the imaplib module doesn't implement any sort of timeouts, either on individual operations or on a whole sequence of them. If you aren't prepared to wait for potentially very long amounts of time (if the IMAP server has something go wrong with it), you need to add some sort of timeout yourself through means outside of imaplib, either something like signal.setitimer() with a SIGALRM handler or through manipulating the underlying socket to set timeouts on it (although I've read that this causes problems, and anyway you're normally going to be trying to work through SSL as well). For my own program I opted to go the SIGALRM route, but I have the advantage that the only thing I'm doing is IMAP. A more sophisticated program might not want to blow itself up with a SIGALRM just because the IMAP side of things was too slow.

Timeouts aren't something that I used to think about when I wrote programs that were mostly run interactively and did only one thing, where the timeout is most sensibly imposed by the user hitting Ctrl-C to kill the entire program. Automated testing programs and other, similar things care a lot about timeouts, because they don't want to hang if something goes wrong with the server. And in fact it is possible to cause imaplib to hang for a quite long time in a very simple way:

m = imaplib.IMAP4_SSL(host=host, port=443)

You don't even need something that actually responds and gets as far as establishing a TLS session; it's enough for the TCP connection to be accepted. This is reasonably dangerous, because 'accept the connection and then hang' is more or less the expected behavior for a system under sufficiently high load (accepting the connection is handled in the kernel, and then the system is too loaded for the IMAP server to run).

Overall I've wound up feeling that the imaplib module is okay for simple, straightforward uses but it's not really a solid base for anything more. Sure, you can probably use it, but you're also probably going to be patching things and working around issues. For us, using imaplib and papering over these issues is the easiest way forward, but if I wanted to do more I'd probably look for a third party module (or think about switching languages).

ImaplibTwoAnnoyances written at 00:33:00; Add Comment


A few notes on using SSL in Python 3 client programs

I was recently writing a Python program to check whether a test account could log into our IMAP servers and to time how long it took (as part of our new Prometheus monitoring). I used Python because it's one of our standard languages and because it includes the imaplib module, which did all of the hard work for me. As is my usual habit, I read as little of the detailed module documentation as possible and used brute force, which means that my first code looked kind of like this:

  m = imaplib.IMAP4_SSL(host=host)
  m.login(user, pw)
except ....:

When I tried out this code, I discovered that it was perfectly willing to connect to our IMAP servers using the wrong host name. At one level this is sort of okay (we're verifying that the IMAP TLS certificates are good through other checks), but at another it's wrong. So I went and read the module documentation with a bit more care, where it pointed me to the ssl module's "Security considerations" section, which told me that in modern Python, you want to supply a SSL context and you should normally get that context from ssl.create_default_context().

The default SSL context is good for a client connecting to a server. It does certificate verification, including hostname verification, and has officially reasonable defaults, some of which you can see in ctx.options of a created context, and also ctx.get_ciphers() (although the latter is rather verbose). Based on the module documentation, Python 3 is not entirely relying on the defaults of the underlying TLS library. However the underlying TLS library (and its version) affects what module features are available; you need OpenSSL 1.1.0g or later to get SSLContext.minimum_version, for example.

It's good that people who care can carefully select ciphers, TLS versions, and so on, but it's better that this seems to have good defaults (especially if we want to move away from the server dictating cipher order). I considered explicitly disabling TLSv1 in my checker, but decided that I didn't care enough to tune the settings here (and especially to keep them tuned). Note that explicitly setting a minimum version is a dangerous operation over the long term, because it means that someday you're lowering the minimum version instead of raising it.

(Today, for example, you might set the minimum version to TLS v1.2 and increase your security over the defaults. Then in five years, the default version could change to TLS v1.3 and now your unchanged code is worse than the defaults. Fortunately the TLS version constants do compare properly so far, so you can write code that uses max() to do it more or less right.)

Python 2.7 also has SSL contexts and ssl.create_default_context(), starting in 2.7.9. However, use of SSL contexts is less widespread than it is in Python 3 (for instance the Python 2 imaplib doesn't seem to support them), so I think it's clear you want to use Python 3 here if you have a choice.

(It seems a little bit odd to still be thinking about Python 2 now that it's less than a year to it being officially unsupported by the Python developers, but it's not going away any time soon and there are probably people writing new code in it.)

Python3SSLInClients written at 01:53:36; Add Comment


I have somewhat mixed feelings about Python 3's socket module errors

Many years ago I wrote about some things that irritated me about Python 2's socket module. One of my complaints was that many instances of socket.error should actually be IOError or OSError instead of a separate type, because that's what they really were. Today I was reading AdamW’s Debugging Adventures: Python 3 Porting 201 (via), where I discovered in a passing mention that in Python 3, socket.error is a deprecated alias of OSError.

(Well, from Python 3.3 onwards, due to PEP 3151.)

On the one hand, this is a change that I cautiously approve of. Many socket errors are just operating system errors, especially on Unix. On the other hand, in some ways this makes socket.herror and socket.gaierror feel worse. Both of these violate the rule of leaving IOError and OSError alone, because they are subclasses of OSError that do not have authentic errno values and are not quite genuine OS errors in the same way (they are errors from the C library, but they don't come from errno). They do have errno and strerror fields, which is something, but then I think all subclasses of OSError do these days.

Somewhat to my surprise, when I looked at the Python 2 socket module I discovered that socket.error is now a subclass of IOError (since Python 2.6, which in practice means 'on any system with Python 2 that you actually want to use'). Python 2 also has the same issue where socket.herror and socket.gaierror are subclasses of socket.error but are not real operating system errors.

Unfortunately for my feelings about leaving OSError alone, the current situation in the socket module is probably the best pragmatic tradeoff. Since the module has high level interfaces that can fail in multiple ways that result in different types of errors, in practice people want to be able to just catch one overall error and be done with it, which means that socket.gaierror really needs to be a subclass of socket.error. When you combine this with socket.error really being some form of OSError, you arrive at the current state of affairs.

I've decided that I don't have a strong opinion on socket.error changing from being a subclass of IOError/OSError to being an alias for it. I can imagine Python code that might want to use try at a high level, call both socket functions and other OS functions within that high level try, and distinguish between the two sources of errors, which is now impossible in Python 3, but I'm not sure that this is a desirable pattern. I don't think I have anything like this in my own Python code, but it's something that I should keep an eye out for as I convert things over to Python 3.

(I do have some Python 2 code that catches both socket.error and EnvironmentError, but fortunately it treats them the same.)

Python3SocketErrors written at 00:27:21; Add Comment


Thinking about DWiki's Python 3 Unicode issues

DWiki (the code behind this blog) is currently Python 2, and it has to move to Python 3 someday, even if I'm in no hurry to make that move. The end of 2018, with only a year of official Python 2 support remaining, seems like a good time to take stock of what I expect to be the biggest aspect of that move, which is character set and Unicode issues (this is also the big issue I ignored when I got DWiki tentatively running under Python 3 a few years ago).

The current Python 2 version of DWiki basically ignores encoding issues. It allows you to specify the character set the HTML will say, but it pretty much treats everything as bytes and makes no attempts to validate that your content is actually valid in the character set you've claimed. This is not viable in Python 3 for various reasons, including that it's not how the Python 3 version of WSGI works (as covered in PEP 3333). Considering Unicode issues for a Python 3 version of DWiki means thinking about everywhere that DWiki reads and writes data from, and deciding what encoding that data is in (and then properly inserting error checks to handle when that data is not actually properly encoded).

The primary source of text data for DWiki is the text of pages and comments. Here in 2018, the only sensible encoding for these is UTF-8, and I should probably just hardcode that assumption into reading them from the filesystem (and writing comments out to the filesystem). Relying on Python's system encoding setting, whatever it is, seems not like a good idea, and I don't think this should be settable in DWiki's configuration file. UTF-8 also has the advantage for writing things out that it's a universal encoder; you can encode any Unicode str to UTF-8, which isn't true of all character encoding.

Another source of text data is the names of files and directories in the directory hierarchy that DWiki serves content from; these will generally appear in links and various other places. Again, I think the only sensible decision in 2018 is to declare that all filenames have to be UTF-8 and undefined things happen if they aren't. DWiki will do its best to do something sensible, but it can only do so much. Since these names propagate through to links and so on, I will have to make sure that UTF-8 in links is properly encoded.

(In general, I probably want to use the 'backslashreplace' error handling option when decoding to Unicode, because that's the option that both produces correct results and preserves as much information as possible. Since this introduces extra backslashes, I'll have to make sure they're all handled properly.)

For HTML output, once again the only sensible encoding is UTF-8. I'll take out the current configuration file option and just hard-code it, so the internal Unicode HTML content that's produced by rendering DWikiText to HTML will be encoded to UTF-8 bytestrings. I'll have to make sure that I consistently calculate my ETag values from the same version of the content, probably the bytestring version (the current code calculates the ETag hash very late in the process).

DWiki interacts with the HTTP world through WSGI, although it's all my own WSGI implementation in a normal setup. PEP 3333 clarifies WSGI for Python 3, and it specifies two sides of things here; what types are used where, and some information on header encoding. For output, generally my header values will be in ISO-8859-1; however, for some redirections, the Location: header might include UTF-8 derived from filenames, and I'll need to encode it properly. Handling incoming HTTP headers and bodies is going to be more annoying and perhaps more challenging; people and programs may well send me incorrectly formed headers that aren't properly encoded, and for POST requests (for example, for comments) there may be various encodings in use and also the possibility that the data is not correctly encoded (eg it claims to be UTF-8 but doesn't decode properly). In theory I might be able to force people to use UTF-8 on comment submissions, and probably most browsers would accept that.

Since I don't actually know what happens in the wild here, probably a sensible first pass Python 3 implementation should log and reject with a HTTP error any comment submission that is not in UTF-8, or any HTTP request with headers that don't properly decode. If I see any significant quantity of them that appears legitimate, I can add code that tries to handle the situation.

(Possibly I should start by adding code to the current Python 2 version of DWiki that looks for this situation and logs information about it. That would give me a year or two of data at a minimum. I should also add an accept-charset attribute to the current comment form.)

DWiki has on-disk caches of data created with Python's pickle module. I'll have to make sure that the code reads and writes these objects using bytestrings and in binary mode, without trying to encode or decode it (in my current code, I read and write the pickled data myself, not through the pickle module).

The current DWiki code does some escaping of bad characters in text, because at one point control characters kept creeping in and blowing up my Atom feeds. This escaping should stay in a Python 3 Unicode world, where it will become more correct and reliable (currently it really operates on bytes, which has various issues).

Since in real life most things are properly encoded and even mostly ASCII, mistakes in all of this might lurk undetected for some time. To deal with this, I should set up two torture test environments for DWiki, one where there is UTF-8 everywhere I can think of (including in file and directory names) and one where there is incorrectly encoded UTF-8 everywhere I can think of (or things just not encoded as UTF-8, but instead Latin-1 or something). Running DWiki against both of these would smoke out many problems and areas I've missed. I should also put together some HTTP tests with badly encoded headers and comment POST bodies and so on, although I'm not sure what tools are available to create deliberately incorrect HTTP requests like that.

All of this is clearly going to be a long term project and I've probably missed some areas, but at least I'm starting to think about it a bit. Also, I now have some preliminary steps I can take while DWiki is still a Python 2 program (although whether I'll get around to them is another question, as it always is these days with work on DWiki's code).

PS: Rereading my old entry has also reminded me that there's DWiki's logging messages as well. I'll just declare those to be UTF-8 and be done with it, since I can turn any Unicode into UTF-8. The rest of the log file may or may not be UTF-8, but I really don't care. Fortunately DWiki doesn't use syslog (although I've already wrestled with that issue).

Sidebar: DWiki's rendering templates and static file serving

DWiki has an entire home-grown template system that's used as part of the processing model. These templates should be declared to be UTF-8 and loaded as such, with it being a fatal internal error if they fail to decode properly.

DWiki can also be configured to serve static files. In Python 3, these static files should be loaded uninterpreted as (binary mode) bytestrings and served back out that way, especially since they can be used for things like images (which are binary data to start with). Unfortunately this is going to require some code changes in DWiki's storage layer, because right now these static files are loaded from disk with the same code that is also used to load DWikiText pages, which have to be decoded to Unicode as they're loaded.

DWikiPython3UnicodeIssues written at 01:01:29; Add Comment


Python 3's approach to filenames and arguments is pragmatically right

A while back I read John Goerzen's The Python Unicode Mess, which decries the Python 3 mess of dealing with filenames and command line arguments on Unix that are not encoded in the program's assumed encoding. As Goerzen notes:

So if you want to actually handle Unix filenames properly in Python, you:

  • Must have a processing path that fully avoids Python strings.
  • Must use sys.{stdin,stdout}.buffer instead of just sys.stdin/stdout
  • Must supply filenames as bytes to various functions. See PEP 0471 for this comment: “Like the other functions in the os module, scandir() accepts either a bytes or str object for the path parameter, and returns the DirEntry.name and DirEntry.path attributes with the same type as path. However, it is strongly recommended to use the str type, as this ensures cross-platform support for Unicode filenames. (On Windows, bytes filenames have been deprecated since Python 3.3).” So if you want to be cross-platform, it’s even worse, because you can’t use str on Unix nor bytes on Windows.

Back in the days when it was new, Python 3 used to be very determined that Unix was Unicode/UTF-8. Years ago this was a big reason that I said you should avoid it from the perspective of a Unix sysadmin. These days things are better; we have things like os.environb and a relatively well defined way of handling sys.argv. This ultimately comes from PEP 383, which gave us the 'surrogateescape' error handler (see the codecs module).

All of this is irritating and unpleasant. Unfortunately, it's also the pragmatically right answer for reasons that PEP 383 alludes to, although it doesn't describe them the way that I would. PEP 383 says:

On the other hand, Microsoft Windows NT has corrected the original design limitation of Unix, and made it explicit in its system interfaces that these data (file names, environment variables, command line arguments) are indeed character data, by providing a Unicode-based API [...]

Let me translate this: filenames, command line arguments, and so on are no longer portable abstractions. They fundamentally mean different things on Unix and on Windows. On Windows, they are 'Unicode' (actually UTF-16) and may include characters not representable as single bytes, while on Unix they are and remain bytes and may include any byte value or sequence except 0. These are two incompatible types, especially once people start encoding non-ASCII filenames or command line arguments on Unix and want their programs to understand the decoded forms in Unicode.

(Or, if you prefer to flip this around, when people start using non-ASCII filenames and command line arguments and so on on Windows and want their programs to understand those as Unicode strings and characters.)

This is a hard problem and modern Python 3 has made the pragmatic choice that it's not going to pretend that things are portable when they aren't (early Python 3 tried to some extent and that blew up in its face). If you are working in the happy path on Unix where you're dealing with properly encoded data, you can ignore this by letting Python 3 automatically decode things to Unicode strs; otherwise, you must work with the raw (Unix) values, and Python 3 will provide them if you ask (and will surface at least some of them by default).

(There are other possible answers but I think that they're all worse than Python 3's current ones for reasons beyond the scope of this entry. For instance, I think that having os.listdir() return a different type on Windows than on Unix would be a bad mistake.)

I'll note that Python 2 is not magically better than Python 3 here. It's just that Python 2 chose to implicitly prioritize Unix over Windows by deciding that filenames, command line arguments, and so on were bytestrings instead of Unicode strings. I rather suspect that this caused Windows people using Python a certain amount of heartburn; we probably just didn't hear as much from them for various reasons.

(You can argue about whether or not Python 3 should have made Unicode the fundamental string type, but that decision was never a pragmatic one and it was made by Python developers very early on. Arguably it's the single decision that created 'Python 3' instead of an ongoing evolution of Python 2.)

PS: This probably counts as me partially or completely changing my mind about things I've said in the past. So be it; time changes us all, and I certainly have different and more positive views on Python 3 now.

Python3PragmaticFilenames written at 00:58:48; Add Comment


Restisting the temptation to rely on Ubuntu for Django 1.11

One of the things that is on my mind is what to do about our Django web application as far as Python 3 goes. Right now it's Python 2, and even apart from people trying to get rid of Python 2 in general, the Django people have been quite explicit that Django 1.11 is the last version that will support Python 2 and that support for it will end in 2020 (probably 'at the start of 2020' in practice). Converting it to Python 3 is getting more and more urgent, but at the same time this is going to be a bunch of grinding work (I still haven't added any tests to it, for example).

The host that our Django web app runs on was recently upgraded to Ubuntu 18.04 LTS, so the other day I idly checked the version of Django that 18.04 packages; this turns out to be Django 1.11 (for both Python 2 and Python 3; Django 2.0 for Python 3 might just have missed Ubuntu's cutoff point, since it was only released at the end of 2017). Ubuntu 18.04 LTS will be supported for five years and Ubuntu never does the sort of major version updates that going from 1.11 to 2.x would be, so for a brief moment I had the great temptation to switch over to the Ubuntu 18.04 packaged version of Django 1.11 and then forgetting about the problem until 2022 or so.

Then I came to my senses, because Ubuntu barely fixes bugs and security issues at the best of times. To my surprise, Ubuntu actually has Django in their 'main' repo, which is theoretically fully supported, but in practice I don't really believe that Canonical will really be spending very much effort to keep Django 1.11 secure after the upstream Django developers drop support for it. No later than 2020, the Ubuntu 18.04 LTS version of Django 1.11 is very likely to become, effectively, abandonware. Unless we feel very confident that Django 1.11 will be completely secure at that point in our configuration, we should not keep running it (especially since a small portion of the application is exposed to the Internet).

(I wouldn't be surprised if Canonical backported at least some easy security fixes from 2.x to 1.11 after 2020. But I would be surprised to see them do any significant programming work for code that's significantly different between 1.11 and the current 2.x or for 1.11-specific issues.)

However much I'd like to ignore the issue for as long as possible or let myself believe that it can be someone else's issue, dealing with this is in my relatively immediate future. We just have to move our Django web app to Python 3 and Django 2.x, even though it's going to be at least a bit of a grind. Probably I should try to do it bit by bit, for example by spending even just an hour or a half hour a week adding a test or two to the current code.

(Part of why I feel so un-motivated is that we're going to have to invest a bunch of effort to wind up exactly where we are currently. The app works perfectly well as it is and we don't want anything that's in newer Django versions; we're upgrading purely to stay within the version coverage of security fixes. This is, sadly, a bunch of make-work.)

DjangoUbuntuLTSBadIdea written at 22:46:42; Add Comment


What Python 3 versions I can use (November 2018 edition)

Back several years ago, I did a couple of surveys of what Python versions I could use for both Python 2 and Python 3, based on what was available on the platforms that we (and I) use. What Python 2 versions are available is almost irrelevant to me now; everything I still care about has a sufficiently recent version of 2.7, and anyway I'm moving to Python 3 for new code both personally and for work. So the much more interesting question is what versions of Python 3 are out there, or at least what major versions. Having gone through this exercise, my overall impression is that the Python 3 version landscape has stabilized for the uses that we currently make of Python 3.

At this point, a quick look at the release dates of various Python 3 versions is relevant. Python 3.4 was released March 16, 2014; 3.5 was released September 13, 2015; 3.6 was released December 23, 2016; 3.7 was only released this June 27, 2018. At this point, anyone using 3.7 on Unix is either using a relatively leading edge Unix distribution or built it themselves (I think it just got into Fedora 29 as the default 'Python 3', for example). However, I suspect that 3.6 is the usual baseline people developing Python 3 packages assume and target, perhaps with some people still supporting 3.5.

At work, we mostly have a mixture of Ubuntu LTS versions. The oldest one is Ubuntu 14.04; it's almost gone but we still have two last 14.04 servers for a couple more months and I actually did write some new Python 3 code for them recently. The current 14.04 Python 3 is 3.4.3, which is close enough to modern Python 3 that I didn't run into any problems in my simple code, but I wouldn't want to write anything significant or tricky that had to run in Python 3 on those machines.

(When I started writing the code, I actually asked myself if I wanted to fall back to Python 2 because of how old these machines were. I decided to see if Python 3 would still work well enough, and it did.)

We have a bunch of Ubuntu 16.04 machines that will be staying like that until 2020 or so, when 16.04 starts falling out of support. Ubuntu 16.04 currently has 3.5.2, and the big feature it doesn't have that I'm likely to run into is probably literal string interpolation; I can avoid it in my own code, but not necessarily in any third party modules I want to use. Until recently, the 16.04 Python 3.5 was the Python 3 that I developed to and most actively used, so it's certainly a completely usable base for our Python 3 code.

Ubuntu 18.04 has Python 3.6.6, having been released a few months before 3.7. I honestly don't see very much in the 3.7 release notes that I expect to actively miss, although a good part of this is because we don't have any substantial Python programs (Python 3 or otherwise). If we used asyncio, for instance, I think we'd care a lot more about not having 3.7.

We have one CentOS 6 machine, but it's turning into a CentOS 7 machine some time in the next year and we're not likely to run much new Python code on it. However, just as back in 2014, CentOS 7 continues to have no version of Python 3 in the core package set. Fortunately we don't need to run any of our new Python 3 programs on our CentOS machines. EPEL has Python 3.4.9 and Python 3.6.6 if we turn out to need a version of Python 3 (CentOS maintains a wiki page on additional repositories).

My own workstation runs Fedora, which is generally current or almost current (depending on when Fedora releases happen and when Python releases happen). I'm currently still on Fedora 28 as I'm waiting for Fedora 29 to get some more bugs fixed. I have Python 3.6.6 by default and I could get Python 3.7 if I wanted it, and my default Python 3 will become 3.7 when I move to Fedora 29.

The machine currently hosting Wandering Thoughts is running FreeBSD 10.4 at the moment, which seems to have Python 3.6.2 available through the Ports system. However, moving DWiki (the Python software behind the blog) to Python 3 isn't something that I plan to do soon (although the time is closer than it was back in 2015). My most likely course of action with DWiki is to see what the landscape looks like for Python 2 starting in 2020, when it's formally no longer supported (and also what the landscape looks like for Python 3, for example if there are prospects of significant changes or if things appear to have quieted down).

(Perhaps I should start planning seriously for a Python 3 version of DWiki, though. 2020 is not that far away now and I don't necessarily move very fast with personal projects these days, although as usual I expect Python 2 to be viable and perfectly good for well beyond then. I probably won't want to write code in Python 2 any more by then, but then I'm not exactly modifying DWiki much right now.)

MyPython3Versions2018-11 written at 22:56:09; Add Comment


The obviousness of inheritance blinded me to the right solution

This is a Python programming war story.

I recently wrote a program to generate things to drive low disk space alerts for our ZFS filesystems in our in-progress Prometheus monitoring system. ZFS filesystems are grouped together into ZFS pools, and in our environment it makes sense to alert on low free space in either or both (ZFS filesystems can run out of space without their pool running out of space). Since we have a lot of filesystems and many fewer pools, it also makes sense to be able to set a default filesystem alert level on a per-pool basis (and then perhaps override it for specific filesystems). The actual data that drives Prometheus must be on a per-object basis, so one thing the program has to do is expand those default alert levels out to be specific alerts for every filesystem in the pool without a specific alert level.

When I began coding the Python to parse the configuration file and turn it into a data representation, I started by thinking about the data representation. It seemed intuitively clear and obvious that a ZFS pool and a ZFS filesystem are almost the same thing, except that a ZFS pool has a bit more information, and therefor they should be in a general inheritance relationship with a fundamental base class (written here using attrs):

class AlertObj:
  name = attr.ib()
  level = attr.ib()
  email = attr.ib()

class FSystem(AlertObj):

class Pool(AlertObj):
  fs_level = attr.ib()

I wrote the code and it worked, but the more code I wrote, the more awkward things felt. As I got further and further in, I wound up adding ispool() methods and calling them here and there, and there was a tangle of things operating on this and that. It all just felt messy. Something was wrong but I couldn't really see what at the time.

For unrelated reasons, we wound up wanting to significantly revise how we drove low disk space alerts and rather than modify my first program, I opted to start over from scratch. One reason for this was because with the benefit of a little bit of distance from my own code, I could see that inheritance was the wrong data model for my situation. The right natural data representation was to have two completely separate sets of objects, one set for directly set alert levels, which lists both pools and filesystems, and one for default alert levels (which only contains pools because they're the only thing that creates default alert levels). The objects all have the same attributes (they only need name, level, and email).

This made the processing logic much simpler. Parsing the configuration file returns both sets of objects, the direct set and the defaultable set. Then we go through the second set and for each pool entry in it, we look up up all of the filesystems in that pool and add them to the first set if they aren't already there. There is no Python inheritance in sight and everything is obviously right and straightforward.

In the new approach, it would also be relatively easy to add default alert levels that are driven by other sorts of things, for instance an idea of who owns a particular entity (pools are often owned collectively by groups, but individual filesystems may be 'owned' and used by specific people, some of whom may not care unless their filesystems are right out of space). The first version's inheritance-based approach would have just fell over in the face of this; a default alert level based on ownership has no 'is-sort-of-a' relationship with ZFS filesystems or pools at all.

I've always known that inheritance wasn't always the right answer, partly because I have the jaundiced C programm's view of object orientation; all of OO's fundamental purpose is to make my code simpler, and if it doesn't do that I don't use it. In theory this should have made me skip inheritance here; in practice, inheritance was such an obvious and shiny hammer that once I saw some of it, I proceeded to hit all of my code with it no matter what.

(If nothing else, the whole experience serves me as a useful learning experience. Maybe the next time around I will more readily listen to the feeling that my code is awkward and maybe something is wrong.)

BlindedByInheritance written at 00:49:22; Add Comment


I should always give my Python classes a __str__ method

I have been going back and forth between Python and Go lately, and as part of that I have (re-)learned a sharp edged lesson about working in Python because of something that Go has built in that Python doesn't.

I do much of my debugging via print() statements or the equivalent. One of the conveniences of Go is that its formatted output package has built-in support for dumping structures. If you have a structure, and usually you do because they're often the Go equivalent of instances of classes, you can just tell fmt.Printf() to print the whole thing out for you with all the values and even the field names.

If you try this trick with a plain ordinary Python class that you've knocked together, what you get is of course:

>>> f = SomeClass("abc", 10)
>>> print(f)
<__main__.SomeClass object at 0x7f4b1f3c7fd0>

To do better, I need to implement a __str__ method. When I'm just putting together first round code to develop my approach to the problem and prove my ideas, it's often been very easy for me to skip this step; after all, I don't need that __str__ method to get my code working. Then I go to debug my code or, more often, explore how it's working in the Python interpreter and I discover that I really could use the ability to just see the insides of my objects without fishing around with dir() and direct field access and so on.

By the time I'm resorting to dir() and direct field access in the Python REPL, I'm not exactly doing print-based debugging any more. Running into this during exploration is especially annoying; I'll call a routine I've just written and I'm now testing, and I'll get back some almost opaque blobs. I could peer inside them, but it's especially annoying because I know I've done this to myself.

As the result of writing some Python both today and yesterday, today's Python resolution is that I'll write basic __str__ methods for all of my little data-holding classes. It only takes a minute or two and it will make my life significantly better.

(If I'm smart I'll make that __str__ follow some useful standard form instead of being clever and making up a format that is specific to the type of thing that I'm putting in a class. There are some times when I want a class-specific __str__ format, but in most cases I think I can at least live with a basically standard format. Probably I should copy what attrs does.)

PS: collections.namedtuple() is generally not what I want for various reasons, including that I'm often going to mutate the fields of my instance objects after they've been created.

Sidebar: Solving this problem with attrs

If I was or am willing to use attrs (which I have pragmatic concerns with for some code), it will solve this problem for me with no fuss or muss:

>>> @attr.s
... class SomeClass:
...    barn = attr.ib()
...    fred = attr.ib()
>>> f = SomeClass("abc", 10)
>>> print(f)
SomeClass(barn='abc', fred=10)

I'm not quite sure that this will get me to use attrs all by itself, but I admit that it's certainly tempting. Attrs is even available as a standard package in Ubuntu 18.04 (with what is a relatively current version right now, 17.4.0 from the end of 2017).

I confess that I now really wish attrs was in the Python standard library so that I could use it without qualms as part of 'standard Python', just as I feel free to use things like urllib and json.

GivingClassesAStr written at 23:43:15; Add Comment

(Previous 10 or go back to October 2018 at 2018/10/16)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.