Wandering Thoughts


Os.walk, the temptation of hammers, and the paralysis of choice

I have a shell script to give me a hierarchical, du-like report of memory usage broken down by Linux cgroup. Even back when I wrote it, it really needed to be something other than a shell script, and a recent addition made it quite clear that the time had come (the shell script version is now both slow and inflexible). So as is my habit, I opened up a 'memdu.py' in my editor and started typing. Some initial functions were easy, until I got to here:

def walkcgroup(top):
  for dirpath, dirnames, filenames in os.walk(top, topdown=True):
    if memstatfile not in filenames:
      dirnames[:] = []

Then I stopped typing because I realized I had a pile of choices of make about exactly how this program was going to be structured, and maybe I didn't want to use os.walk(), as shown by the very first thing I wrote inside the for loop.

The reason I started writing code with os.walk() is because it's the obvious hammer to use when you want to walk all over a directory tree, such as /sys/fs/cgroup. But on the other hand, I realized that I'm not just visiting each directory; I'm in theory constructing a hierarchical tree of memory usage information. What os.walk() gives you is basically a linear walk, so if you want a tree reconstructing it is up to you. It's also more awkward to cut off walking down the tree if various conditions are met (or not met), especially if one of the conditions is 'my memory usage is the same as my parent's memory usage'. If what I want is really a tree, then I should probably walk the directory hierarchy myself (and pass each step its parent node, already loaded with memory information, and so on).

On the third hand, the actual way this information will be printed out is as a (sorted) linear list, so if I build a tree I'll have to linearize it later. Using os.walk() linearizes it for me in advance, and I can then readily sort it into some order. I do need to know certain information about parents, but I could put that in a dict that maps (dir)paths to their corresponding data object (since I'm walking top down I know that the parent will always be visited before the children).

A lot of these choices come down to what will be more convenient to code up, and these choices exist at all because of the hammer of os.walk(). Given the hammer, I saw the problem as a nail even though maybe it's a screw, and now I've realized I can't see what I have. Probably the only way to do so is to write one or the other version of the code and see how it goes. Why haven't I done that, and instead set aside the whole of memdu.py? That's because I don't want to 'waste time' by writing the 'wrong' version, which is irrational. But here I am.

(Having written this entry I've probably talked myself into going ahead with the os.walk version. If the code starts feeling awkward, much of what I've built will probably be reusable for a tree version.)

PS: This isn't the first time I've been blinded by a Python feature.

OsWalkChoiceParalysis written at 22:46:00; Add Comment


Getting a Python 2 virtual environment (in 2023's twilight of Python 2)

Suppose, not entirely hypothetically, that you need to create a new Python 2 virtual environment today; perhaps you need to install some package to see how its old Python 2 version behaves. With Python 3, creating a virtual environment is really easy; it's just 'python3 -m venv /tmp/pytest'. With Python 2 today, you have two complications. First, Python 2 doesn't have a venv module (instead it uses a 'virtualenv' command), and second, your installed Python 2 environment may not have all of the necessary infrastructure already set up since people are deprecating Python 2 and cutting down any OS provided version of it to the bare minimum.

First, you need a Python 2 version of pip. Hopefully you have one already; if not, you want the 2.7 version of get-pip.py, but don't count on that URL lasting forever, as the URL in my 2021 entry on this didn't. I haven't tested this latest version, so cross your fingers. If you still care at all about Python 2, you probably really want to make sure you have a pip2 at this point.

Once you have a pip2 in one way or another, you want to do a user install of 'virtualenv', with 'pip2 install --user virtualenv'. This will give you a ~/.local/bin/virtualenv command, which you may want to rename to 'virtualenv2'. You can then use this to create your virtual environment, 'virtualenv2 /tmp/pytest'. The result should normally have everything you need to use the virtualenv, including a pip2, and you can then use this virtualenv pip2 to install the package or packages you need to poke at.

Incidentally, if you just want to get a copy of the Python 2 version of a particular package and not specifically install it somewhere, you can just use pip2 to download it, with 'pip2 download <whatever>'. I'm not sure that the result is necessarily immediately usable and you'll have to decode it yourself ('file' may be your friend), but depending on what you want this may be good enough.

(I took a quick look to see if there was an easier way to find out the last supported Python 2 version of a package than 'pip2 download <whatever>', but as far as I can see there isn't.)

(This is one of the entries that I write for myself so that I have this information if I ever need it again, although I certainly hope not to.)

PS: Another option is to use the Python 2.7 version of PyPy, which I believe comes pre-set with its own pip2, although not its own already installed virtualenv. Depending on how concerned you are about differences in behavior between CPython 2.7 and PyPy 2.7, this might not be a good option.

Python2VirtualEnvIn2023 written at 22:26:57; Add Comment


A bit on unspecified unique objects in Python

In Why Aren't Programming Language Specifications Comprehensive? (via), Laurence Tratt shows the following example of a difference in behavior between CPython and PyPy:

$ cat diffs.py
print(str(0) is str(0))
$ python3 diffs.py
$ pypy diffs.py

Tratt notes that Python's language specification doesn't specify the behavior here, so both implementations are correct. Python does this to preserve the ability of implementations to make different choices, and Tratt goes on to use the example of __del__ destructors. This might leave a reader who is willing to accept the difference in destructor behavior to wonder why Python doesn't standardize object identity here.

Since this code uses 'is', the underlying reason for the difference in behavior is whether two invocations of 'str(0)' in one expression result in the same actual object. In CPython 3, they don't; in PyPy, they do. On the one hand, making these two invocations create the same object is an obvious win, since you're creating less objects and thus less garbage. A Python implementation could do this by knowing that using str() on a constant results in a constant result so it only needs one object, or it could intern the results of expressions like 'str(0)' so that they always return the same object regardless of where they're invoked. So allowing this behavior is good for Python environments that want to be nicely optimized, as PyPy does.

On the other hand, doing either of these things (or some combination of them) is extra work and complexity in an implementation. Depending on the path taken to this optimization, you have to either decide what to intern and when, then keep track of it all, or build in knowledge about the behavior of the built in str() and then verify at execution time that you're using the builtin instead of some clever person's other version of str(). Creating a different str() function or class here would be unusual but it's allowed in Python, so an implementation has to support it. You can do this analysis, but it's extra work. So not requiring this behavior is good for implementations that don't want to have the code and take the (extra) time to carefully do this analysis.

This is of course an example of a general case. Languages often want to allow but not require optimizations, even when these optimizations can change the observed behavior of programs (as they do here). To allow this, careful language specifications set up explicit areas where the behavior isn't fixed, as Python does with is (see the footnote). In fact, famously CPython doesn't even treat all types of objects the same:

$ cat diff2.py
print(int('0') is int('0'))
$ python3 diff2.py
$ pypy diff2.py

Simply changing the type of object changes the behavior of CPython. For that matter, how we create the object can change the behavior too:

$ cat diff3.py
print(chr(48) == str(0))
print(chr(48) is chr(48))
print(chr(48) is str(0))
$ python3 diff3.py

Both 'chr(48)' and 'str(0)' create the same string value, but only one of them results in the same object being returned by multiple calls. All of this is due to CPython's choices about what it optimizes and what it doesn't. These choices are implementation specific and also can change over time, as the implementation's views change (which is to say as the views of CPython's developers change).

UnspecifiedUniqueObjects written at 23:16:25; Add Comment


In Python, zero is zero regardless of the number type

I recently saw a Fediverse post by Mike Samuel with a Python pop quiz that tripped me up:

@shriramk Since I know you appreciate Python pop quizzes:

my_heterogeneous_map = {
    (  0.0): "positive zero",
    ( -0.0): "negative zero",
    (    0): "integer zero",

print("my_heterogeneous_map=%r\n" % my_heterogeneous_map)

del my_heterogeneous_map[False]

print("my_heterogeneous_map=%r\n" % my_heterogeneous_map)

Before I actually tried it, I expect the dict to start out with either two or three entries and end up with one or two, given that boolean True and False are actually ints with False being the same as zero. In fact the dict starts out with one entry and ends up with none, because in Python all three of these zeros are equal to each other:

>>> 0.0 == -0.0 == 0

(This is sort of the inversion of how NaNs behave as keys in dictionaries.)

In fact this goes further. A complex number zero is equal to plain zero:

>>> complex(0,0) == 0.0
>>> complex(0,-0.0) == 0.0
>>> complex(-0.0,-0.0) == 0.0

(All three of those are different complex numbers, as you can see by printing them all, although they all compare equal to each other.)

However this is simply one instance of a general case with how Python has chosen to treat complex numbers (as well as comparisons between integers and floats):

>>> complex(1,0) == 1
>>> complex(20,0) == 20

This particular behavior for complex numbers doesn't seem to be explicitly described in the specification. Numeric Types — int, float, complex says about arithmetic operators and comparisons on mixed types:

Python fully supports mixed arithmetic: when a binary arithmetic operator has operands of different numeric types, the operand with the “narrower” type is widened to that of the other, where integer is narrower than floating point, which is narrower than complex. A comparison between numbers of different types behaves as though the exact values of those numbers were being compared.

I suppose that Python would say that the 'exact value' of a complex number with a 0 imaginary component is its real component. The equality comparison for complex numbers does at least make sense given that '20 + complex(0,0)' is '(20+0j)', or to put it another way, '20 - complex(20,0)' is (0j) and Python would probably like that to compare equal to the other versions of zero. If 'a - b == 0' but 'a != b', it would feel at least a little bit odd.

(Of course you can get such a situation with floating point numbers, but floating point numbers do odd and counter-intuitive things that regularly trip people up.)

This explanation of comparison, including equality, makes sense for 0.0 being equal to 0 (and in fact for all floating point integral values, like 20.0, being equal to their integer version; the exact value of '20.0' is the same as the exact value of '20'). As for -0.0, it turns out that the IEEE 754 floating point standard says that it should compare equal to 0.0 (positive zero), which by extension means it has the same 'exact value' as 0.0 and thus is equal to 0.

(This comes from Wikipedia's page on Signed zero).)

PS: I think the only way to detect a negative zero in Python may be with math.copysign(); there doesn't appear to be an explicit function for it, the way we have math.isinf() and math.isnan().

ZeroIsZeroAcrossNumberTypes written at 22:19:12; Add Comment


Debian has removed Python 2 from its next version

The news of the time interval is that Debian's development version has removed even the 'minimal' version of Python 2.7 (via). Among other things, this includes the 'python2-minimal' and 'python2.7-minimal' packages, both of which are gone from Debian's 'testing' pseudo-distribution as well as 'unstable'. In future Debian releases, people who want Python 2 will have to build it themselves in some way (for example, copying the binary package from the current 'bullseye' release, or copying the source package and rebuilding). We've been expecting this to happen for some time, but the exact timing was uncertain until now.

Since Ubuntu generally follows Debian for things like this, I expect that the next Ubuntu LTS release (which would normally be Ubuntu 24.04 in April of 2024) won't include Python 2 either. As I write this, the in development Ubuntu 'lunar' still contains the python2-minimal package (this is 'Lunar Lobster', expected to be 23.04, cf). With four months to go before the expected release (and less time before a package freeze), I don't know if Canonical will follow Debian and remove the python2-minimal package. I wouldn't be surprised either way.

Both Canonical and Debian keep source packages around for quite a while, so people have plenty of time to grab the source .deb for python2-minimal. Pragmatically, we might as well wait to see if Canonical or Debian release additional patch updates, although that seems pretty unlikely at this point. We're very likely to keep a /usr/bin/python2 around for our users, although who knows.

Fedora currently has a python2.7 package, but I suspect that Debian's action has started the clock ticking on its remaining lifetime. However, I haven't yet spotted a Fedora Bugzilla tracking bug about this (there are a few open bugs against their Python 2.7 package). Since I still have old Python 2 programs on my Fedora desktops that I use and don't feel like rewriting, I will probably grab the Fedora source and binary RPMs at some point to avoid having to take more drastic actions.

(This means that my guess two years ago that Fedora would move before Debian turned out to be wrong.)

DebianNoMorePython2 written at 21:40:22; Add Comment


Sometimes an Ubuntu package of a Python module is probably good enough

Recently I ran across a Python program we're interested in, and discovered that it required prometheus-client. Normally this would mean creating a virtual environment and installing the module into it, possibly along with the program (although you don't have to put the program inside the venv it uses). But when I looked out of curiosity, I saw that Ubuntu packages this module, which got me to thinking.

I'm generally a sceptic of relying on the Ubuntu packaged version of a Python module (or any OS's packaged version); I wrote about this years ago in the context of Django. Linux distribution packaging of Python modules is famously out of date, and Ubuntu makes it worse by barely fixing bugs at the best of times. However, this feels like a somewhat different situation. The program isn't doing anything much with the prometheus-client module, and the module itself isn't very demanding; probably quite a lot of versions will do, and there's unlikely to be a bug that affects us. Indeed, some quick testing of the program with the Ubuntu version suggests that it works fine.

(Although now that I look, the Ubuntu version is rather out of date. Ubuntu 22.04 LTS packages 0.9.0, from 2020-11-26, and right now according to the module's releases page it's up to 0.15.0, with quite a few changes.)

Provided that Ubuntu's version of the module works, which it seems to, using the Ubuntu packaged version is the easy path. It's not an ideal situation, but for something with simple needs (and which isn't a high priority), it's rather tempting to say that it's okay. And if the Ubuntu version proves unsatisfactory, changing over to the latest version in a virtual environment is (at one level) only a matter of changing the path to Python 3 in the program's '#!' line.

(We have another program that requires a Python module, pyserial, that we get from Ubuntu, but I didn't think about it much at the time. This time around I first built a scratch venv for the program to test it, then discovered the Ubuntu package.)

UbuntuPackagesGoodEnough written at 21:20:16; Add Comment


Python version upgrades and deprecations

Recently I read Itamar Turner-Trauring's It’s time to stop using Python 3.7 (via). On the one hand, this is pragmatic advice, because as the article mentions Python 3.7 is reaching its end of life as of June 2023. On the other hand it gives me feelings, and one of the feelings is that the Python developers are not making upgrades any easier by slowly deprecating various standard library modules. Some of these modules are basically obsolete now, but some are not and have no straightforward replacement, such as the cgi module.

The Python developers can do whatever they want to do (that's the power of open source), and they clearly want to move Python forward (as they see it) by cleaning up the standard library. But this means that they are perfectly willing to break backward compatibility in Python 3, at least for the standard library.

One of the things that make upgrading versions of anything easy is if the new version is a drop in replacement for the old one. Deprecating and eventually removing things in new versions means that new versions are not drop in replacements, which means that it makes upgrading harder. When upgrading is harder, (more) people put it off or don't do it. This happens regardless of what the software authors like or think, because people are people.

I doubt this is a direct factor in people still using Python 3.7. But I can't help but think that the Python developers' general attitude toward backward compatibility doesn't help.

(Python virtual environments make different versions of Python not exactly a drop in replacement; in practice you're going to want to rebuild the venv. But my impression is that pretty much everyone who is seriously using Python with venvs has the tools and experience to do that relatively easily, because their venvs are automatically built from specifications. Someday I need to master doing that myself, because sooner or later we're going to need to use venvs and be able to migrate between Python versions as part of an OS upgrade.)

PythonUpgradesAndDeprectation written at 23:09:28; Add Comment


Python dictionaries and floating point NaNs as keys

Like Go, Python's floating point numbers support NaNs with the usual IEEE-754 semantics, including not comparing equal to each other. Since Python will conveniently produce them for us, we can easily demonstrate this:

>>> k = float('nan')
>>> k == k
>>> k is k

Yesterday, I discovered that Go couldn't delete 'NaN' keys from maps (the Go version of dicts). If you initially try this in Python, it may look like it works:

>>> d = {k: "Help"}
>>> d
{nan: 'Help'}
>>> d[k]
>>> del d[k]

However, all is not what it seems:

>>> d = {k: "Help", float('nan'): "Me"}
>>> d
{nan: 'Help', nan: 'Me'}
>>> d[float('nan')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: nan

What's going on here is that Python dict indexing has a fast path for object identity, which comes into play when you look up something using exactly the same object that you used to set an entry. When you set a dict entry, Python saves the object you used as the key. If you ask a dict to look up an entry using that exact object, Python doesn't even bother calling the object's equality operation (what would be used for an '==' check); it just returns the value. This means that floating point NaNs have no chance to object that they're never equal to each other, and lookup will succeed. However, if you use a different object that is also a NaN, the lookup will fail because two NaNs never compare equal to each other.

This use of object identity in dict lookups does mean that the Python equivalent of iterating a Go map will always work:

>>> for k in d.keys():
...   d[k]

When you ask a dictionary for its keys, you of course get the literal Python objects that are the keys, which can always be used to look up the corresponding entry in the dict even if they're NaNs or otherwise uncomparable or inequal under normal circumstances.

One of the other things that this starts to show us is that Python is not making any attempt to intern NaNs, unlike things like True, False, and small integers. Let's show that more thoroughly:

>>> k2 = float('nan')
>>> k is k2
>>> k is math.nan

It might be hard to make all NaNs generated through floating point operations be the same interned object, but it would be relatively straightforward to make 'float("nan")' always produce the same Python object and for that Python object to also be math.nan. But Python doesn't do either of those; every NaN is a unique object. Personally I think that this is the right choice (whether or not it's deliberate); NaNs are supposed to all be different from each other anyway, so using separate objects is slightly better.

(I suspect that Python doesn't intern any floating point numbers, but I haven't checked the source code. On a quick check it doesn't intern 0.0 or +Inf; I didn't try any others. In general, I expect that interning floating point numbers makes much less sense and would result in much less object reuse than interning small integers and so on does.)

DictsAndNaNKeys written at 22:12:29; Add Comment


Importing a Python program that doesn't have a .py extension

In Python, there are a bunch of reasons for having your main program be importable. However, this normally requires that your program have a .py extension and you don't always want to do this. You can always make a copy of the program (or use a symlink) to add the extension when you're working on it, but that can be annoying. Due to writing my entry on why programs should skip having extensions if possible, I wound up wondering if it was possible to do this in Python using things exposed in, for example, the standard library's importlib..

The answer turns out to be yes but you have to go out of your way. The importlib documentation has an example on Importing a source file directly, but it only works for files with a .py extension. The reason for this is covered in the Github gist Importing Python source code from a script without the .py extension; importlib basically hardcodes that source files use a .py extension. However, as covered in the gist, you can work around this without too much work.

Suppose that you have a Python program called "machines", and you want to import the production version to poke around some bits of it. Then:

>>> from importlib.machinery import SourceFileLoader
>>> import importlib.util
>>> import sys
>>> loader = SourceFileLoader("machines", "/opt/local/bin/machines")
>>> spec = importlib.util.spec_from_loader("machines", loader)
>>> machines = importlib.util.module_from_spec(spec)
>>> sys.modules["machines"] = machines
>>> spec.loader.exec_module(machines)

Here I use 'machines' for what the example calls 'module' so that I can avoid an extra 'import machines' to get the program visible under its customary name in my interactive session.

While I'm happy to know this is possible and how to do it in case I ever really need it, this is tedious enough that I'll probably only ever use it infrequently. If I found myself needing to do it somewhat regularly, I'd probably create a personal module (call it, say, 'impfile') that wrapped this up in a function, so I could just do 'import impfile; impfile.get("/opt/local/bin/machines"); import machines'. This personal module could go in either ~/.local/ or in a custom virtual environment that I'd use when I wanted to do this.

(Neither would work if the program itself used a virtual environment, but that's unlikely for the kind of programs I'd want to use this on.)

PS: I think I understand why importlib doesn't have a convenience function to do this or something like it, and I agree with not providing one. In the standard library, it would be too much of an attractive nuisance; it would practically invite people to use it.

(Then they would be sad when a bunch of tools for working with Python code choked on their special files, because the tools are always going to require .py files.)

ImportABareProgram written at 22:49:33; Add Comment


Python virtual environments can usually or often be moved around

Python virtual environments are magical in various ways. They get transparently added to sys.path and programs can be outside of them as long as they use the venv's Python (which is normally a symlink to some system version of Python), for two examples. All of this magic is triggered by the presence of a pyvenv.cfg file at the root of the venv (cf). The contents of this pyvenv.cfg are very minimal and in particular they don't name the location of the venv's root.

For example, here's a pyvenv.cfg:

home = /usr/bin
include-system-site-packages = false
version = 3.10.5

In fact this is the pyvenv.cfg of no less than six venvs from my Fedora 36 desktop (these are being managed through pipx, but pipx creates normal venvs). All of the pyvenv.cfg files all have the same contents because they're all using the same Python version and general settings.

Since pyvenv.cfg and the rest of the virtual environment don't contain any absolute paths to themselves and so don't 'know' where they're supposed to be, it's possible to move venvs around on the filesystem. As a corollary of this it's possible to copy a venv to a different system (in a different filesystem location or the same), provided that the system has the same version of Python, which is often the case if you're using the same Linux distribution version on both. This doesn't seem to be explicitly documented in the venv module documentation and it's possible that some Python modules you may install do require absolute paths and aren't movable, but it seems to be generally true.

If you use pipx there's a caution here, because pipx writes a pipx_shared.pth file into the venv's site-packages directory that does contain the absolute path to its shared collection of Python stuff. I believe this is part of the underlying cause of pipx's problem with Python version upgrades, which is fixed by removing this shared area and having pipx rebuild it.

Another caution comes from systems like Django which may create standard programs or files as part of their project setup. If you create a Django venv and start a Django project from it, Django will create a 'manage.py' executable for the project that has the venv's (current) absolute path to its Python interpreter burned into its '#!' line. If you then move this venv, your manage.py will (still) try to use the Python from the old venv's location, which will either not work or get you the wrong site-packages path.

On the one hand, it's convenient that this works in general, and that there's nothing in the general design of virtual environments that blocks it. On the other hand, it's clear that you can have various corner cases (as shown with pipx and Django), so it's probably best to create your venvs in their final location if you can. If you do have to move venvs (for example they have to be built in one directory and deployed under another), you probably want to test the result and scan for things with the absolute path burned into them.

(I noticed this pyvenv.cfg behavior when I first looked at venvs and sys.path, but I didn't look very much into it at the time. As usual, writing an entry about this has left me better informed than before I started.)

VenvsCanUsuallyBeMoved written at 22:28:01; Add Comment

(Previous 10 or go back to July 2022 at 2022/07/30)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.