Wandering Thoughts

2020-03-29

I set up Python program options and arguments in a separate function

Pretty much every programming language worth using has a standard library or package for parsing command line options and arguments, and Python is no exception; the standard for doing it is argparse. Argparse handles a lot of the hard work for you, but you still have to tell it what your command line options are, provide help text for things, and so on. In my own Python programs, I almost always do this setup in a separate function that returns a fully configured argparse.ArgumentParser instance.

My standard way of writing all of it looks like this:

def setup():
  p = argparse.ArgumentParser(usage="...",
                              ....)
  p.add_argument(...)
  p.add_argument(...)

  return p

def main():
  p = setup()
  opts = p.parse_args()
  ...

I don't like putting all of this directly in my main() because in most programs I write, this setup work is long and verbose enough to obscure the rest of what main() is doing. The actual top level processing and argument handling is the important thing in main(), not the setup of options, so I want all of the setup elsewhere where it's easy to skip over. In theory I could put it at the module level, not in a function, but I have a strong aversion to running code at import time. Among other issues, if I got something wrong I would much rather have the stack trace clearly say that it's happening in setup() than something more mysterious.

Putting it in a function that's run explicitly can have some advantages in specialized situations. For instance, it's much more natural to use complex logic (or run other functions) to determine the default arguments for some command line options. For people who want to write tests for this sort of thing, having all of the logic in a function also makes it possible to run the function repeatedly and inspect the resulting ArgumentParser object.

(I think it's widely accepted that you shouldn't run much or any code at import time by putting it in the top level. But setting up an ArgumentParser may look very much like setting up a simple Python data structure like a map or a list, even though it's not really.)

ArgparseSetupWhere written at 00:22:07; Add Comment

2020-03-03

One impact of the dropping of Python 2 from Linux distributions

Due to uncertainty over the future of the Python 2 interpreter in future Linux distributions, I've been looking at some of our Python 2 code, especially the larger programs. This caused me to express some views over on Twitter, which came out long enough that I'm recycling them here with additional commentary:

Everyone's insistence on getting rid of Python 2 is magically transforming all of this perfectly functional and useful Python 2 code we have from an asset to a liability. You can imagine how I feel about that.

Functioning code that you don't have to maintain and that just works is an asset; it sits there, doing a valuable job, and requires no work. Code that you have to do significant work on just so that it doesn't break (not to add any features) is a liability; you have to do work and inject risk and you get nothing for it.

Some code is straightforward to lift to Python 3 because it doesn't do anything complicated. Some code is not like that:

Today's 'what am I going to do about this' Python 2 code is my client implementation of the Sendmail milter protocol, which is all about manipulating strings as binary over a network connection. I guess I shotgun b"..." and then start guessing.

My milter implementation has been completely stable since written in Python 2 in 2011. Now I have to destabilize it because people are taking Python 2 away.

(I do not have tests. Tests would require another milter implementation that was known to be correct.)

(What I meant by the end of the first tweet is making various strings into bytestrings, especially protocol literals, and trying to push that through the protocol handling.)

As a side note, testing protocol implementations is hard when you don't have some sort of reference version that you can embed in your tests, even if you implement both the client and the server side. Talking to yourself doesn't insure that you haven't made some mistake, either in the initial implementation or in a translation into the Python 3 world of bytestrings and Unicode strings and trying to handle network IO in that world and so on.

(For instance, since UTF-8 can encode every codepoint you can put into a Unicode string, including control characters and so on, you could write an encoder and decoder that actually operated on Unicode strings without you realizing, then have Python 3's magic string handling convert them to UTF-8 over the wire as you sent them back and forth between yourself during tests. Your implementation would talk to itself, but not to any outside version that did not UTF-8 encode what were supposed to be raw bytes. You could even pass tests against golden pre-encoded protocol messages if they were embedded in your Python test code and you forgot that you needed to turn them into bytestrings.)

I also had an opinion on the idea that we've known this for a while and it's just a cost of using Python:

Python 2 is only legacy through fiat (multiple fiats, both the main CPython developers and then OS distributions). Otherwise it is perfectly functional and almost certainly completely secure, and would keep running fine for a great deal longer.

Just because software is not being updated doesn't mean that it stops working. If people would leave Python 2 alone (and keep it available in Linux distributions as a low-support or unsupported package, like so many others), it would likely keep going on fine for years, but because they won't, our Python 2 code is steadily being converted from an asset to a liability. Of course, part of the fun is that we don't even know for sure if people will be getting rid of the Python 2 interpreter itself, much less a timetable for it.

(Maybe the current statements from Debian and Ubuntu are supposed to answer that question, but if so they're not clear to me and they certainly don't give a timeline for when the Python 2 interpreter itself will be gone.)

PS: All of this is completely separate from the virtues of Python 3 for new code, where I default to it over some other options in our environment.

Python2DroppingImpact written at 21:36:15; Add Comment

2020-02-04

What 'is' translates to in CPython bytecode

The main implementation of Python, usually called CPython , translates Python source code into bytecode before interpreting it. How this translation happens can make some things fast, such as how local variables are implemented. When I wrote in yesterday's entry that having 'is' as a keyword can make it faster than if it was a built-in function because as a keyword it doesn't have to be looked up all the time just in case you changed it, I wondered how CPython actually translated 'a is b' to bytecode. The answer turns out to be somewhat more interesting than I expected.

(Bytecode can be most conveniently inspected with the dis module, and the module's documentation helpfully explains a fair bit about what the disassembled representation means.)

Let's define a little function:

def f(a):
   return a is 10

Now we can disassemble this with 'dis.dis(f.__code__)' and get:

2   0 LOAD_FAST      0 (a)
    2 LOAD_CONST     1 (10)
    4 COMPARE_OP     8 (is)
    6 RETURN_VALUE

CPython bytecodes can have an auxiliary value associated with them (shown here as the rightmost column, along with their meaning for the particular bytecode operation). Rather than have separate bytecodes for different comparison operators, all comparisons are implemented with a single bytecode, COMPARE_OP, that picks which comparison to do based on the auxiliary value. The 'is' comparison is just the same as any other; if we used 'return a > 10' in our function, the only difference in the bytecode would be the auxiliary value for COMPARE_OP (it would become 4 instead of 8).

The next obvious question to ask is how 'is not' is implemented, and the answer is that it's another comparison type. If we change our function to use 'is not', the only change is this:

    4 COMPARE_OP     9 (is not)

CPython has one last trick up its sleeve. If we write 'not a is 10', CPython specifically recognizes this and rather than translating it as a COMPARE_OP followed by a UNARY_NOT, translates it straight into the 'is not' comparison. This isn't a general transformation, for various reasons; 'return not a > 10' won't be similarly translated to the bytecode equivalent of 'return a <= 10'.

(CPython does go the extra distance to translate 'not a is not 10' into 'a is 10'. I'm a little bit surprised, since I wouldn't expect people to write that very often.)

PS: One advantage of 'is' being a keyword is that it allows CPython to do this transformation, since CPython always knows what 'is' does here. It wouldn't be safe to transform a hypothetical 'not isidentity(a, 10)' in the same way, since what isidentity does could always be changed by rebinding the name.

IsCPythonBytecode written at 21:08:03; Add Comment

The place of the 'is' syntax in Python

Over on Twitter, I said:

A Python cold take (given how long it's taken me to arrive at it): 'is' should not be a keyword, it should be a built-in function that you're discouraged from using unless you really know what you're doing. As a keyword it's too tempting.

Python has two versions of equality, in ==, which is plain equality, and is, which is object identity; 'a is b' is true if and only if a and b refer to the same object. Since the distinction between names and values is fundamental to Python, we definitely need a way of testing this (for example, to explore a puzzling mistake I once made). However, I'm not so sure it should be a language keyword.

The issue with 'is' as a language keyword is that it makes using object identity temptingly easy; after all, there's a keyword for it, part of the language syntax. It's as if you're supposed to use it. The first problem with this is simply that object identity is a relatively advanced Python concept, one that's a bit tricky to get your head around. Python code that genuinely needs to use is instead of == is almost invariably doing something tricky, and we should generally avoid inviting people to routinely write code that at least looks like tricky code. The second problem is that in practice object identity can be tricky because Python implementations (especially CPython) can quietly make objects be the same thing (and thus 'a is b' will be true) when you didn't expect them to be. It's possible to write safe code that uses 'is', but you need to know a fair bit about what you're doing; perfectly sensible looking code can conceal subtle bugs.

(When Python will give you the same object for two apparently different things depends on the specific version of (C)Python and also sometimes the exact way that you created the objects. It can get quite weird and involved.)

There are at least two reasons I can think of to still have is as a keyword. The first is that as a keyword, what it does is guaranteed by the language and is not subject to being modified by people who play games with namespaces in the way that, say, isinstance() can be changed. Changing what isinstance() does by defining your own version is probably a terrible idea, but you can do it if you feel the urge. Meanwhile, is is beyond the reach of anything but bytecode rewriting. The second is that because is is part of the language and isn't subject to being changed, it can be implemented in a way that makes it faster than a built-in function. Built-in functions need to go through a global name lookup when they're used, just in case, while is can be just done directly since it's part of the language.

(Local variables are fast because they avoid this lookup.)

PS: Of course by now all of this is entirely theoretical. It's entirely too late for Python to drop 'is' as a keyword, and even thinking about it is a bit silly. But I apparently twitch a bit when I see 'is' casually used in code examples, and that's sort of what inspired the tweet that led to this entry.

IsSyntaxPlace written at 00:28:49; Add Comment

2020-01-30

Some notes on Python's email.header.decode_header()

I've recently been investigating some oddly encoded MIME Content-Disposition headers that were turned up by our Python-based system for recording email attachment type information. As part of this I wanted to decode those RFC 2047 encoded-words, obviously using Python because that's what we were using to start with.

Under normal circumstances, you're apparently supposed to read in a whole email message into an _email.message.EmailMessage and then dig through it. I did not have a whole email message; I didn't even have a whole isolated MIME header. I just had a chunk of RFC 2047 encoded data to decode. The first thing to know is that if you care about good handling of RFC 2047 encoded things, you should be using Python 3. I had an existing old Python 2 program using email.header.decode_header(), and it turned out to mis-decoded a header value that Python 3 handled fine using the same function.

Now that I've actually read all of the documentation for email.header, how you should use it to generate a decoded form is probably to take advantage of all of its convenience functions, by explicitly decoding the header, then making a email.header.Header instance, then getting the string form of it:

dcd = email.header.decode_header(headerstr)
hdr = email.header.make_header(dcd)
return str(hdr)

(This omits error checking. As is documented in the docstring for decode_header but not in the module's documentation, it can raise at least email.errors.HeaderParseError in some situations, such as a base64 decoding problem.)

This makes the module do all the hard work of decoding the somewhat arcane results of calling decode_header. But let's assume that you first wrote your program to directly interpret and use those results, and you'd like to know what you get (in Python 3, which is different from Python 2). What you get back from decode_header is a list of tuples:

[(data1, charset1), (data2, charset2), ...]

Often the list will have only one tuple for various reasons beyond the scope of this entry, but it's always possible to get multiple ones (and in different character sets). There are three main cases of what the tuples can be:

  • the data is a bytestring and the character set is a non-blank normal character set (as a Python 3 string). To produce Unicode, you need to do 'data1.decode(charset1)'. The error handling policy you want to use on decoding is up to you.

  • the data is a Python string and the character set is None. This is what you get back if the entire header is not encoded at all, and probably in some other cases. You can use the data as is, since it's already a string.

  • the data is a bytestring and the character set is None. This is what you get back for a non-encoded portion of a header with some encoded portion (and possibly in other circumstances). In theory this is pure ASCII, but don't hold your breath; you probably want to decode this to a string as UTF-8, perhaps with some liberal error handling policy.

If the RFC 2047 encoding is sufficiently mangled in the right way, you may get back a tuple with a character set of '' (a blank string) instead of the exception that you may have been expecting. On the one hand this will make .decode fail; on the other hand, it fails with an 'unknown encoding' error and you can get that if people just claim their header is in some weird character encoding Python has never heard of before, so you already need to handle it.

All of this is a mess. I suggest that you just call make_header, because then you get to file bugs with the Python people if it doesn't work (and doesn't raise a clear error exception), as opposed to patching your own code for yet more special cases.

In general, unfortunately, the email.header module is probably not designed to deal well with arbitrary input from the general Internet; I suspect that it tacitly assumes that it's mostly dealing with well-formed email. There are a lot of mail-generating programs out there with bugs and generous interpretations of what they can get away with, especially if you have to deal with spam and malware (which are often generated by programs with more than the usual number of bugs).

DecodeEmailHeaderNotes written at 23:35:10; Add Comment

2020-01-19

Python 2, Apache's mod_wsgi, and its future in Linux distributions

Sometimes I have small questions about our future with Python 2, instead of big ones. Our Django web application currently runs in Apache using mod_wsgi, and the last time we attempted a Django upgrade (which is a necessary step in an upgrade to Python 3), it didn't go well. This means that we may wind up caring quite a bit about how long Ubuntu and other Linux distributions will package a version of mod_wsgi that still supports Python 2, instead of just Python 3 (assuming that the Linux distribution even provides Python 2 at all).

Fedora 31 currently still provides a Python 2 sub-package of mod_wsgi, but this should be gone in Fedora 32 since it naturally depends on Python 2 and all such (sub-)packages are supposed to be purged. Debian's 'unstable' also currently seems to have the Python 2 version of mod_wsgi, but it's included in Debian's list of Python 2 related packages to be removed (via), so I suspect it will be gone from the next stable Debian release.

(Debian is also getting rid of Python 2 support for uwsgi, which could be another way of running our WSGI application under Apache.)

What Ubuntu 20.04 will look like is an interesting question. Right now, the in-progress state of Ubuntu 'focal' (what will be 20.04) includes a libapache2-mod-wsgi package using Python 2. However, this package is listed in Ubuntu's list of Python 2 related packages to remove (via). Ubuntu could still remove the package (along with others), or it could now be too close to the release of 20.04 for the removal to be carried through by then.

(I believe that Ubuntu usually freezes their package set a decent amount of time before the actual release in order to allow for testing, and perhaps especially for LTS releases. I may be wrong about this, because the Ubuntu Focal Fossa release schedule lists the Debian import freeze as quite late, at the end of February.)

Even if the Python 2 version of mod_wsgi manages to stay in Ubuntu 20.04 LTS (perhaps along with other Python 2 WSGI gateways), it will definitely be gone by the time of Ubuntu 22.04, which is when we'd normally upgrade the server that currently hosts our Django web app. So by 2022, we need to have some solution for our Python 2 problem with the app, whatever it is.

Python2ApacheWsgiFuture written at 22:17:05; Add Comment

2020-01-17

The question of how long Python 2 will be available in Linux distributions

In theory Python 2 is now dead (sort of). In practice we have a significant number of Python 2 scripts and programs (including a Django web app, probably like many other places. Converting these to Python 3 is a make-work project we want to avoid, especially since it risks breaking important things that are working fine. One big and obvious issue for keeping on using Python 2 is that we use Linux (primarily Ubuntu) and normally use the system version of Python 2 in /usr/bin (although we're starting to shift how we invoke it). This obviously only works as long as there is a packaged version installed in /usr/bin, which raises the question of how long this will be available in Linux distributions.

Linux distributions like Debian, Ubuntu, and Fedora want to move away from officially supporting Python 2 because there is no upstream support (RHEL 8 will be supporting their version through 2024 or so). Of these, Debian and by extension Ubuntu have an idea of less-core packages that are built and maintained by interested parties, and I suspect that there will be people interested in packaging Python 2 even past when it stops being a fully supported package. Failing that being accepted, Ubuntu has the idea of PPAs, which people can use to distribute easily usable packages for Ubuntu (we get Certbot through a PPA, for example). Fedora doesn't quite have the packaging split that Debian does, but it has a PPA-like thing in COPR (also). I suspect that there is sufficient interest in Python 2 that people will provide PPAs and COPR repos for it.

(At the extreme end of things, we can and will build our own package of Python 2 if necessary by just re-building the last available version from a previous distribution release. We wouldn't get the ecology of additional Python 2 .debs or RPMs, but we don't need those.)

As far as I can tell, the current state of Python 2 in Fedora 32 is that Python 2.7 has become a legacy package as part of Fedora's plans to retire Python 2. The current Fedora plans have no mention of removing the 2.7 legacy Python package, but given how Fedora moves I wouldn't be surprised to see calls for that to happen in a few years (which would be inconvenient for me; a few years is quite fast here). Alternately, it might linger quietly for half a decade or more if it turns out to require no real work on anyone's part.

I expect Debian and Ubuntu to move more slowly than this but in the same direction. Ubuntu 20.04 may not be able to drop all packages that depend on Python 2.7, but by Ubuntu 22.04 I expect that work to be done and so Python 2 itself could be and probably will be demoted to a similar legacy status. Since 2022 is only two years away and Debian is not the fastest moving organization when it comes to controversial things like removing Python 2 entirely, I suspect that discussion of removing Python 2 itself will start no earlier than for Ubuntu 24.04. However I can't find a Debian or Ubuntu web page that talks about their future plans for Python 2 itself in any detail, so we may get surprised.

PS: In our environment, the issue with Python 2 going away (or /usr/bin/python changing which Python version it points to) isn't just our own system maintenance programs, but whatever Python programs our users may have written and be running. We likely have no real way to chase those down and notify the affected users, so any such shift would be very disruptive, especially because we run multiple versions of Ubuntu at once. With different Ubuntu versions on different machines, what /usr/bin/python gets you could vary from one machine to another. At that point we might be better off removing the name entirely; at least things with '#!/usr/bin/python' would fail immediately and clearly.

Python2InLinuxHowLong written at 01:29:12; Add Comment

2020-01-12

Sorting out the dates of Python 2's 'end of life'

Probably like many people, I've been hearing for years now that January of 2020 was the end of life for Python 2, specifically January 1st. As a result, I was rather surprised to hear that there will be another release of Python 2 in April, although I could have read the actual details in PEP 373 and avoided this.

The official dates, from PEP 373, are:

Planned future release dates:

  • 2.7.18 code freeze January, 2020
  • 2.7.18 release candidate early April, 2020
  • 2.7.18 mid-April, 2020

What this actually means is not clear to me, given the four month delay between the code freeze (now) and the planned release of even a 2.7.18 release candidate (April). At a minimum, I assume that the code freeze blocks new features, should anyone want to submit any. I suspect that the Python people would not accept fixes for new bugs or for existing bugs that did not have some version of a fix accepted before the code freeze. I assume that Python developers will still accept fixes for accepted bugfixes, if testing shows that any have problems.

(If Python isn't going to accept changes into what will be released as 2.7.18 for any reason at all, they might as well release tomorrow instead of in four months.)

Although the details are set out in PEP 373, this way of describing Python 2's end of life is a little bit unusual and different from what I (and likely other people) expected from an 'End of Life' date. The usual practice with EOL dates is that absolutely nothing will be released beyond that point, not that main development stops and then a final release will be made some time later.

(This is what Linux distributions do, for example; the EOL date for a distribution release is when the last package updates will come out. I believe it's similar for the BSD Unixes.)

It's very unclear to me how Linux distributions (and the BSDs) are likely to handle Python 2 versions in light of this. At least some of them will still be packaging Python 2 in versions released beyond April of 2020. They might freeze their Python 2 version on the current 2.7.17 (or whatever they already have), or upgrade to 2.7.18 as one last Python 2 (re-)packaging.

Python2EOLDates written at 22:18:50; Add Comment

2019-12-22

Filenames and paths should be a unique type and not a form of strings

I recently read John Goerzen's The Fundamental Problem in Python 3, which talks about Python 3's issues in environments where filenames (and other things) are not in a uniform and predictable encoding. As part of this, he says:

[...]. Critically, most of the Python standard library treats a filename as a String – that is, a sequence of valid Unicode code points, which is a subset of the valid POSIX filenames.

[...]

From a POSIX standpoint, the correct action would have been to use the bytes type for filenames; this would mandate proper encode/decode calls by the user, but it would have been quite clear. [...]

This is correct only from a POSIX standpoint, and then only sort of (it's correct in traditional Unix filesystems but not necessarily all current ones; some current Unix filesystems can restrict filenames to properly encoded UTF-8). The reality of modern life for a language that wants to work on Windows as well as Unix is that filenames must be presented as a unique type, not any form of strings or bytes.

How filenames and paths are represented depends on the operating system, which means that for portability filenames and paths need to be an opaque type that you have to explicitly insert string-like information into and extract string-like information out of, specifying the encoding if you don't want an opaque byte sequence of unpredictable contents. As with all encoding related operations, this can fail in both directions under some circumstances.

Of course this is not the Python 3 way. The Python 3 way is to pretend that everything is fine and that the world is all UTF-8 and Unicode. This is pretty much the pragmatically correct choice, at least if you want to have Windows as a first class citizen of your world, but it is not really the correct way. As with all aspects of its handling of strings and Unicode, Python 3 chose convenience over reality and correctness, and has been patching up the resulting mess on Unix since its initial release.

If Python was going to do this correctly, Python 3 would have been the time to do it; since it was breaking things in general, it could have introduced a distinct type and required that everything involving file names change to taking and returning that type. But that would have made porting Python 2 code harder and would have made it less likely that Python 3 was accepted by Python programmers, which is probably one reason it wasn't done.

(I don't think it was the only one; early Python 3 shows distinct signs that the Python developers had more or less decided to only support Unix systems where everything was proper UTF-8. This turned out to not be a viable position for them to maintain, so modern Python 3 is somewhat more accommodating of messy reality.)

FilenamesUniqueType written at 01:46:30; Add Comment

2019-12-14

It's unfortunately time to move away from using '/usr/bin/python'

For a long time, the way to make Python programs runnable on Unix has been to start them with '#!/usr/bin/python' or sometimes '#!/usr/bin/env python' (and then chmod them executable, of course; this makes them scripts). Unfortunately this is no longer a good idea for general Python programs, for the simple reason that current Unixes now disagree on what version of Python is '/usr/bin/python'. Instead, we all need to start explicitly specifying what version of Python we want by using '/usr/bin/python3' or '/usr/bin/python2' (or by having env explicitly run python3 or python2).

For a long time, even after Python 3 came out, it seemed like /usr/bin/python would stay being Python 2 in many environments (ones where you had Python 2 and Python 3 installed side by side). I expected a deprecation of /usr/bin/python as Python 2 to take years after Python 2 itself was no longer supported, for the simple reason that there are a lot of programs and instructions out there that expect their '#!/usr/bin/python' or 'python' to run Python 2. Changing what that meant seemed reasonably disruptive, even if it was the theoretically correct and pure way.

In reality, as I recently found out, Fedora 31 switched what /usr/bin/python means, and apparently Arch Linux did it several years ago. In theory PEP 394 describes the behavior here and this behavior is PEP-acceptable. In practice, before early July of 2019, PEP 394 said that 'python' should be Python 2 unless the user had explicitly changed it or a virtual environment was active. Then, well, there was a revision that basically threw up its hands and said that people could do whatever they wanted to with /usr/bin/python (via).

(This makes PEP 394 a documentation standard. As with all documentation standards, it needs to describe reality to be useful, and the reality is that /usr/bin/python is now completely unpredictable.)

Since Fedora and Arch Linux have led the way here, other Linux distributions will probably follow. In particular, since Red Hat Enterprise is more or less based on Fedora, I wouldn't be surprised to see RHEL 9 have /usr/bin/python be Python 3. I don't think Debian and thus Ubuntu will be quite this aggressive just yet, but I wouldn't be surprised if in a couple of years /usr/bin/python at least defaults to Python 3 on Ubuntu. (Hopefully Python 2 will still be available as a package.)

UsrBinPythonNoMore written at 00:55:20; Add Comment

(Previous 10 or go back to November 2019 at 2019/11/24)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.