Wandering Thoughts


In Python, zero is zero regardless of the number type

I recently saw a Fediverse post by Mike Samuel with a Python pop quiz that tripped me up:

@shriramk Since I know you appreciate Python pop quizzes:

my_heterogeneous_map = {
    (  0.0): "positive zero",
    ( -0.0): "negative zero",
    (    0): "integer zero",

print("my_heterogeneous_map=%r\n" % my_heterogeneous_map)

del my_heterogeneous_map[False]

print("my_heterogeneous_map=%r\n" % my_heterogeneous_map)

Before I actually tried it, I expect the dict to start out with either two or three entries and end up with one or two, given that boolean True and False are actually ints with False being the same as zero. In fact the dict starts out with one entry and ends up with none, because in Python all three of these zeros are equal to each other:

>>> 0.0 == -0.0 == 0

(This is sort of the inversion of how NaNs behave as keys in dictionaries.)

In fact this goes further. A complex number zero is equal to plain zero:

>>> complex(0,0) == 0.0
>>> complex(0,-0.0) == 0.0
>>> complex(-0.0,-0.0) == 0.0

(All three of those are different complex numbers, as you can see by printing them all, although they all compare equal to each other.)

However this is simply one instance of a general case with how Python has chosen to treat complex numbers (as well as comparisons between integers and floats):

>>> complex(1,0) == 1
>>> complex(20,0) == 20

This particular behavior for complex numbers doesn't seem to be explicitly described in the specification. Numeric Types — int, float, complex says about arithmetic operators and comparisons on mixed types:

Python fully supports mixed arithmetic: when a binary arithmetic operator has operands of different numeric types, the operand with the “narrower” type is widened to that of the other, where integer is narrower than floating point, which is narrower than complex. A comparison between numbers of different types behaves as though the exact values of those numbers were being compared.

I suppose that Python would say that the 'exact value' of a complex number with a 0 imaginary component is its real component. The equality comparison for complex numbers does at least make sense given that '20 + complex(0,0)' is '(20+0j)', or to put it another way, '20 - complex(20,0)' is (0j) and Python would probably like that to compare equal to the other versions of zero. If 'a - b == 0' but 'a != b', it would feel at least a little bit odd.

(Of course you can get such a situation with floating point numbers, but floating point numbers do odd and counter-intuitive things that regularly trip people up.)

This explanation of comparison, including equality, makes sense for 0.0 being equal to 0 (and in fact for all floating point integral values, like 20.0, being equal to their integer version; the exact value of '20.0' is the same as the exact value of '20'). As for -0.0, it turns out that the IEEE 754 floating point standard says that it should compare equal to 0.0 (positive zero), which by extension means it has the same 'exact value' as 0.0 and thus is equal to 0.

(This comes from Wikipedia's page on Signed zero).)

PS: I think the only way to detect a negative zero in Python may be with math.copysign(); there doesn't appear to be an explicit function for it, the way we have math.isinf() and math.isnan().

ZeroIsZeroAcrossNumberTypes written at 22:19:12; Add Comment


Debian has removed Python 2 from its next version

The news of the time interval is that Debian's development version has removed even the 'minimal' version of Python 2.7 (via). Among other things, this includes the 'python2-minimal' and 'python2.7-minimal' packages, both of which are gone from Debian's 'testing' pseudo-distribution as well as 'unstable'. In future Debian releases, people who want Python 2 will have to build it themselves in some way (for example, copying the binary package from the current 'bullseye' release, or copying the source package and rebuilding). We've been expecting this to happen for some time, but the exact timing was uncertain until now.

Since Ubuntu generally follows Debian for things like this, I expect that the next Ubuntu LTS release (which would normally be Ubuntu 24.04 in April of 2024) won't include Python 2 either. As I write this, the in development Ubuntu 'lunar' still contains the python2-minimal package (this is 'Lunar Lobster', expected to be 23.04, cf). With four months to go before the expected release (and less time before a package freeze), I don't know if Canonical will follow Debian and remove the python2-minimal package. I wouldn't be surprised either way.

Both Canonical and Debian keep source packages around for quite a while, so people have plenty of time to grab the source .deb for python2-minimal. Pragmatically, we might as well wait to see if Canonical or Debian release additional patch updates, although that seems pretty unlikely at this point. We're very likely to keep a /usr/bin/python2 around for our users, although who knows.

Fedora currently has a python2.7 package, but I suspect that Debian's action has started the clock ticking on its remaining lifetime. However, I haven't yet spotted a Fedora Bugzilla tracking bug about this (there are a few open bugs against their Python 2.7 package). Since I still have old Python 2 programs on my Fedora desktops that I use and don't feel like rewriting, I will probably grab the Fedora source and binary RPMs at some point to avoid having to take more drastic actions.

(This means that my guess two years ago that Fedora would move before Debian turned out to be wrong.)

DebianNoMorePython2 written at 21:40:22; Add Comment


Sometimes an Ubuntu package of a Python module is probably good enough

Recently I ran across a Python program we're interested in, and discovered that it required prometheus-client. Normally this would mean creating a virtual environment and installing the module into it, possibly along with the program (although you don't have to put the program inside the venv it uses). But when I looked out of curiosity, I saw that Ubuntu packages this module, which got me to thinking.

I'm generally a sceptic of relying on the Ubuntu packaged version of a Python module (or any OS's packaged version); I wrote about this years ago in the context of Django. Linux distribution packaging of Python modules is famously out of date, and Ubuntu makes it worse by barely fixing bugs at the best of times. However, this feels like a somewhat different situation. The program isn't doing anything much with the prometheus-client module, and the module itself isn't very demanding; probably quite a lot of versions will do, and there's unlikely to be a bug that affects us. Indeed, some quick testing of the program with the Ubuntu version suggests that it works fine.

(Although now that I look, the Ubuntu version is rather out of date. Ubuntu 22.04 LTS packages 0.9.0, from 2020-11-26, and right now according to the module's releases page it's up to 0.15.0, with quite a few changes.)

Provided that Ubuntu's version of the module works, which it seems to, using the Ubuntu packaged version is the easy path. It's not an ideal situation, but for something with simple needs (and which isn't a high priority), it's rather tempting to say that it's okay. And if the Ubuntu version proves unsatisfactory, changing over to the latest version in a virtual environment is (at one level) only a matter of changing the path to Python 3 in the program's '#!' line.

(We have another program that requires a Python module, pyserial, that we get from Ubuntu, but I didn't think about it much at the time. This time around I first built a scratch venv for the program to test it, then discovered the Ubuntu package.)

UbuntuPackagesGoodEnough written at 21:20:16; Add Comment


Python version upgrades and deprecations

Recently I read Itamar Turner-Trauring's It’s time to stop using Python 3.7 (via). On the one hand, this is pragmatic advice, because as the article mentions Python 3.7 is reaching its end of life as of June 2023. On the other hand it gives me feelings, and one of the feelings is that the Python developers are not making upgrades any easier by slowly deprecating various standard library modules. Some of these modules are basically obsolete now, but some are not and have no straightforward replacement, such as the cgi module.

The Python developers can do whatever they want to do (that's the power of open source), and they clearly want to move Python forward (as they see it) by cleaning up the standard library. But this means that they are perfectly willing to break backward compatibility in Python 3, at least for the standard library.

One of the things that make upgrading versions of anything easy is if the new version is a drop in replacement for the old one. Deprecating and eventually removing things in new versions means that new versions are not drop in replacements, which means that it makes upgrading harder. When upgrading is harder, (more) people put it off or don't do it. This happens regardless of what the software authors like or think, because people are people.

I doubt this is a direct factor in people still using Python 3.7. But I can't help but think that the Python developers' general attitude toward backward compatibility doesn't help.

(Python virtual environments make different versions of Python not exactly a drop in replacement; in practice you're going to want to rebuild the venv. But my impression is that pretty much everyone who is seriously using Python with venvs has the tools and experience to do that relatively easily, because their venvs are automatically built from specifications. Someday I need to master doing that myself, because sooner or later we're going to need to use venvs and be able to migrate between Python versions as part of an OS upgrade.)

PythonUpgradesAndDeprectation written at 23:09:28; Add Comment


Python dictionaries and floating point NaNs as keys

Like Go, Python's floating point numbers support NaNs with the usual IEEE-754 semantics, including not comparing equal to each other. Since Python will conveniently produce them for us, we can easily demonstrate this:

>>> k = float('nan')
>>> k == k
>>> k is k

Yesterday, I discovered that Go couldn't delete 'NaN' keys from maps (the Go version of dicts). If you initially try this in Python, it may look like it works:

>>> d = {k: "Help"}
>>> d
{nan: 'Help'}
>>> d[k]
>>> del d[k]

However, all is not what it seems:

>>> d = {k: "Help", float('nan'): "Me"}
>>> d
{nan: 'Help', nan: 'Me'}
>>> d[float('nan')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: nan

What's going on here is that Python dict indexing has a fast path for object identity, which comes into play when you look up something using exactly the same object that you used to set an entry. When you set a dict entry, Python saves the object you used as the key. If you ask a dict to look up an entry using that exact object, Python doesn't even bother calling the object's equality operation (what would be used for an '==' check); it just returns the value. This means that floating point NaNs have no chance to object that they're never equal to each other, and lookup will succeed. However, if you use a different object that is also a NaN, the lookup will fail because two NaNs never compare equal to each other.

This use of object identity in dict lookups does mean that the Python equivalent of iterating a Go map will always work:

>>> for k in d.keys():
...   d[k]

When you ask a dictionary for its keys, you of course get the literal Python objects that are the keys, which can always be used to look up the corresponding entry in the dict even if they're NaNs or otherwise uncomparable or inequal under normal circumstances.

One of the other things that this starts to show us is that Python is not making any attempt to intern NaNs, unlike things like True, False, and small integers. Let's show that more thoroughly:

>>> k2 = float('nan')
>>> k is k2
>>> k is math.nan

It might be hard to make all NaNs generated through floating point operations be the same interned object, but it would be relatively straightforward to make 'float("nan")' always produce the same Python object and for that Python object to also be math.nan. But Python doesn't do either of those; every NaN is a unique object. Personally I think that this is the right choice (whether or not it's deliberate); NaNs are supposed to all be different from each other anyway, so using separate objects is slightly better.

(I suspect that Python doesn't intern any floating point numbers, but I haven't checked the source code. On a quick check it doesn't intern 0.0 or +Inf; I didn't try any others. In general, I expect that interning floating point numbers makes much less sense and would result in much less object reuse than interning small integers and so on does.)

DictsAndNaNKeys written at 22:12:29; Add Comment


Importing a Python program that doesn't have a .py extension

In Python, there are a bunch of reasons for having your main program be importable. However, this normally requires that your program have a .py extension and you don't always want to do this. You can always make a copy of the program (or use a symlink) to add the extension when you're working on it, but that can be annoying. Due to writing my entry on why programs should skip having extensions if possible, I wound up wondering if it was possible to do this in Python using things exposed in, for example, the standard library's importlib..

The answer turns out to be yes but you have to go out of your way. The importlib documentation has an example on Importing a source file directly, but it only works for files with a .py extension. The reason for this is covered in the Github gist Importing Python source code from a script without the .py extension; importlib basically hardcodes that source files use a .py extension. However, as covered in the gist, you can work around this without too much work.

Suppose that you have a Python program called "machines", and you want to import the production version to poke around some bits of it. Then:

>>> from importlib.machinery import SourceFileLoader
>>> import importlib.util
>>> import sys
>>> loader = SourceFileLoader("machines", "/opt/local/bin/machines")
>>> spec = importlib.util.spec_from_loader("machines", loader)
>>> machines = importlib.util.module_from_spec(spec)
>>> sys.modules["machines"] = machines
>>> spec.loader.exec_module(machines)

Here I use 'machines' for what the example calls 'module' so that I can avoid an extra 'import machines' to get the program visible under its customary name in my interactive session.

While I'm happy to know this is possible and how to do it in case I ever really need it, this is tedious enough that I'll probably only ever use it infrequently. If I found myself needing to do it somewhat regularly, I'd probably create a personal module (call it, say, 'impfile') that wrapped this up in a function, so I could just do 'import impfile; impfile.get("/opt/local/bin/machines"); import machines'. This personal module could go in either ~/.local/ or in a custom virtual environment that I'd use when I wanted to do this.

(Neither would work if the program itself used a virtual environment, but that's unlikely for the kind of programs I'd want to use this on.)

PS: I think I understand why importlib doesn't have a convenience function to do this or something like it, and I agree with not providing one. In the standard library, it would be too much of an attractive nuisance; it would practically invite people to use it.

(Then they would be sad when a bunch of tools for working with Python code choked on their special files, because the tools are always going to require .py files.)

ImportABareProgram written at 22:49:33; Add Comment


Python virtual environments can usually or often be moved around

Python virtual environments are magical in various ways. They get transparently added to sys.path and programs can be outside of them as long as they use the venv's Python (which is normally a symlink to some system version of Python), for two examples. All of this magic is triggered by the presence of a pyvenv.cfg file at the root of the venv (cf). The contents of this pyvenv.cfg are very minimal and in particular they don't name the location of the venv's root.

For example, here's a pyvenv.cfg:

home = /usr/bin
include-system-site-packages = false
version = 3.10.5

In fact this is the pyvenv.cfg of no less than six venvs from my Fedora 36 desktop (these are being managed through pipx, but pipx creates normal venvs). All of the pyvenv.cfg files all have the same contents because they're all using the same Python version and general settings.

Since pyvenv.cfg and the rest of the virtual environment don't contain any absolute paths to themselves and so don't 'know' where they're supposed to be, it's possible to move venvs around on the filesystem. As a corollary of this it's possible to copy a venv to a different system (in a different filesystem location or the same), provided that the system has the same version of Python, which is often the case if you're using the same Linux distribution version on both. This doesn't seem to be explicitly documented in the venv module documentation and it's possible that some Python modules you may install do require absolute paths and aren't movable, but it seems to be generally true.

If you use pipx there's a caution here, because pipx writes a pipx_shared.pth file into the venv's site-packages directory that does contain the absolute path to its shared collection of Python stuff. I believe this is part of the underlying cause of pipx's problem with Python version upgrades, which is fixed by removing this shared area and having pipx rebuild it.

Another caution comes from systems like Django which may create standard programs or files as part of their project setup. If you create a Django venv and start a Django project from it, Django will create a 'manage.py' executable for the project that has the venv's (current) absolute path to its Python interpreter burned into its '#!' line. If you then move this venv, your manage.py will (still) try to use the Python from the old venv's location, which will either not work or get you the wrong site-packages path.

On the one hand, it's convenient that this works in general, and that there's nothing in the general design of virtual environments that blocks it. On the other hand, it's clear that you can have various corner cases (as shown with pipx and Django), so it's probably best to create your venvs in their final location if you can. If you do have to move venvs (for example they have to be built in one directory and deployed under another), you probably want to test the result and scan for things with the absolute path burned into them.

(I noticed this pyvenv.cfg behavior when I first looked at venvs and sys.path, but I didn't look very much into it at the time. As usual, writing an entry about this has left me better informed than before I started.)

VenvsCanUsuallyBeMoved written at 22:28:01; Add Comment


Python is my default choice for scripts that process text

Every so often I wind up writing something that needs to do something more complicated than can be readily handled in some Bourne shell, awk, or other basic Unix scripting tools. When this happens, the language I most often turn to is Python, and especially Python is my default choice when the work I'm doing involves processing text in some way (or often if I need to generate text). For example, if I want to analyze the output of some command and generate Prometheus metrics from it, Python is often my choice. These days, this is Python 3, even with its warts with handling non-Unicode input (which usually don't come up in this context).

(A what a lot of these programs do could be summarized as string processing with logic.)

In theory there's no obvious reason that my language of choice couldn't be, say, Go. But in practice, Python has much less friction than something like Go while still having enough structure and capabilities to be better than a much more limited tool like awk. One part of this is Python's casualness about typing, especially typing in dicts. In Python, you can shove anything you want into a dict and it's completely routine to have dicts with heterogenous values (usually your keys are homogenous, eg all strings). This might be madness in a large program, but for small, quickly written things it's a great speedup.

(Some of the need for this can be lessened with dataclasses or attrs. And Python lets you scale up from basic dicts to those, or to basic classes used as little more than records, as you decide they make your code simpler.)

Another area where Python reduces friction is in the lack of explicit error handling while still not hiding errors; exceptions insure that while you may not deal with errors well, you will deal with them one way or another. Again this isn't necessarily what you want in a bigger, more structured program, but in the small it's quite handy to not have to ornament every 'int(...)' or whatever with some sort of error check.

In general, Python is (surprisingly) good at pulling strings apart, shuffling them around, and putting them back together, while still staying structured enough to let me follow what the code does even when I come back to it later. Compact, low ceremony inline string formatting is often quite useful (I use '%' because I'm old fashioned).

Python certainly isn't the only language that can be used in this way; Perl and Ruby are two other obvious examples, and more modern people would probably reach for Javascript. But Python is the one that I've wound up latching on to and sticking with.

I do find it a bit amusing and ironic that despite all of the issues in Python 3 with Unicode and IO (and my gripes surrounding that), it's what I normally use for processing text. In theory, I risk explosions; in practice, it works because I'm in a UTF-8 capable environment with well formed input (often just plain ASCII, which is the most common case for log files and command output).

PythonForStringHandling written at 22:09:30; Add Comment


Humanizing numbers in Python through a regexp substitution function

Recently I was looking at files that contained a bunch of sizes in bytes with very widely varying magnitudes, something like this:

file 10361909248
percpu 315360
inactive_file 8666644480
active_file 1695264768
slab_reclaimable 194324760
slab 194324760

(This is from Linux cgroup memory accounting.)

I find it hard to look at these numbers and have any feel for how big they are in absolute or relative terms, especially if I don't want to spend a lot of time thinking about it. It's much easier for me to read these numbers if they're humanized into things like '9.7G', '308.0K', and '185.3M'. To make these files more readable, I wrote a Python filter program to replace these raw byte counts with their humanized versions.

One reason I used Python for this filter is that it's my default choice for Unix text processing that requires more than sed or a light veneer of awk. Another reason is that I knew that Python's re module had a feature that made this filter very easy, which is that re.sub() can take a function as the replacement instead of a string.

Using a replacement function meant that I could write a simple function that took a match object that was guaranteed to be all decimal digits and turn it into a humanized number (in string form). Then the main look was just:

rxp = re.compile("\d+")
def process():
  for line in sys.stdin:
    l = line.strip()
    l = rxp.sub(humanize_number, l)

The regular expression substitution does all the work of splitting the line apart and reassembling it afterward. I only need to feed lines in and dump them out afterward.

(My regular expression here is a bit inefficient; I could make it skip all one, two, and three digit numbers, for example. That would also keep it from matching numbers in identifiers, eg if a file had a line like 'fred1 100000'. For my purposes I don't need to be more precise right now, but a production version might want to be more careful.)

Python's regular expression function substitution is a handy and powerful way to do certain sorts of very generalized text substitution in a low hassle manner. The one caution to it is that you probably don't want to use it in a performance sensitive situation, because it does require a Python function call and various other things for each substitution. The last time I looked, pure text substitutions ran much faster if you could use them. Here, not only is the situation not performance sensitive but there's no way out of running the Python code one way or another, because we can't do the work with just text substitution (at least not if we want powers of two humanized numbers, as I do).

Sidebar: The humanization function

I started out writing the obvious brute force if based version and then realized that I could get much simpler code by being a bit more clever. The end result is:

KB = 1024
MB = KB * 1024
GB = MB * 1024
TB = GB * 1024

seq = ((TB, 'T'), (GB, 'G'), (MB, 'M'), (KB, 'K'))

def humanize_number(mtch):
  n = int(mtch.group())
  for sz, ch in seq:
    if n >= sz:
      return '%.1f%s' % (n / sz, ch)
  return str(n)

The seq tuple needs to be ordered from the largest unit to the smallest, because we take the first unit that the input is equal to or larger than.

RegexpFunctionSubstitutionWin written at 21:42:00; Add Comment


What is our Python 2 endgame going to be?

Every so often I think about the issue of what our eventual Python 2 endgame is going to be at work. We're going to reach some sort of endgame situation sooner or later; for example, Ubuntu has already removed support for /usr/bin/python being Python 2, although you can still do it by hand. Someday they (and other people) may mandate that /usr/bin/python is Python 3, or remove Python 2 packages entirely, or both. What are we going to do when things reach that state?

There are two sides of this; what we're going to do about our own scripts that are still using Python 2, and what will happen with our users and their scripts. For our own scripts, they could could be rewritten to Python 3 or changed to use a different Python interpreter path in their #! line, including PyPy. Since we're in control of them and the timing of any use of an operating system without Python 2, we're at least not going to be blindsided. My tentative guess at our endgame for our own scripts is that we'd probably use PyPy, although we might opt to move them to Python 3 instead.

(There's very little chance that our remaining Python 2 scripts will all conveniently be obsolete by the time CPython 2 is disappearing from Ubuntu and other operating systems. Making them obsolete would probably take a completely rebuilt from scratch new infrastructure.)

For our users, there is both good news and bad news. The good news is that as a university department, we have a certain natural degree of turnover in user population; when someone graduates and leaves, they mostly stop caring about their Python 2 scripts they had here (or moves on to a different postdoc position, or any number of other things). The bad news is that we seem to have a reasonably significant current use of '/usr/bin/python' and we haven't even looked for people who are running '/usr/bin/python2' or some other alias. Some of that usage is probably automated (in cron jobs and the like), and some of it is probably from people who will be around for years to come. In addition, not all usage of Python 2 will be in regularly run scripts (that we can catch through mechanisms like Linux's auditing framework); some of it is probably in scripts that are only run once in a while.

Unless we get lucky and things are deferred for a significant amount of time, changing /usr/bin/python (to remove it or to be Python 3) or removing Python 2 seems likely to catch a number of our users out. We probably can't find all of them in advance, or get all of them to change things even if we do find them and notify them. Some number of them will probably have long-standing scripts blow up. To reduce problems here we should probably start moving now to discourage use of Python 2 (and identify people using it).

If it's possible, the least disruptive endgame would be to continue having /usr/bin/python and CPython 2 (in the usual places), even if we provide it ourselves. However, keeping the '/usr/bin/python' name working may hamper efforts to herd people away from Python 2; at some point in the endgame, we may want to remove it or let it become Python 3. While we can use PyPy 2 for our own scripts, it's not a drop in replacement for CPython and some programs definitely fail with PyPy when they'd work with CPython.

(Also, I'm not absolutely sure that PyPy will still have a Python 2 version in, say, ten years. Yes I am considering that far into the future.)

A more disruptive endgame would be Ubuntu insisting that /usr/bin/python be Python 3 and no longer supplying Python 2 at all. If we have relatively few people using an explicit '/usr/bin/python2', we might drop our official support for CPython 2 entirely. Hopefully Ubuntu would still supply a PyPy 2, so people would have some option other than migrating their scripts to Python 3.

A third endgame would be the 'excise the remnants' option. When Ubuntu drops Python 2 entirely, we would as well regardless of the remaining use; we wouldn't hand build CPython 2 ourselves or anything. We would handle our own scripts in some way, and other people would be left on their own, with at best us installing the Ubuntu version of PyPy 2 if one existed. This endgame is the most disruptive to people but in some way the most coherent and least work for us in the long run.

PS: Fedora forced /usr/bin/python to be Python 3 a while back, and honestly it's been a good thing overall for me. I had to change some scripts in a hurry, but after that it's nice that running 'python' gets me the version I want and so on. And it's a good way to push me to use Python 3 instead of Python 2.

ConsideringOurPython2Endgame written at 22:36:56; Add Comment

(Previous 10 or go back to May 2022 at 2022/05/24)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.