2023-04-30
Os.walk, the temptation of hammers, and the paralysis of choice
I have a shell script to give me a hierarchical, du-like report of memory usage broken down by Linux cgroup. Even back when I wrote it, it really needed to be something other than a shell script, and a recent addition made it quite clear that the time had come (the shell script version is now both slow and inflexible). So as is my habit, I opened up a 'memdu.py' in my editor and started typing. Some initial functions were easy, until I got to here:
def walkcgroup(top): for dirpath, dirnames, filenames in os.walk(top, topdown=True): if memstatfile not in filenames: dirnames[:] = [] continue
Then I stopped typing because I realized I had a pile of choices of make about exactly how this program was going to be structured, and maybe I didn't want to use os.walk(), as shown by the very first thing I wrote inside the for loop.
The reason I started writing code with os.walk() is because it's the obvious hammer to use when you want to walk all over a directory tree, such as /sys/fs/cgroup. But on the other hand, I realized that I'm not just visiting each directory; I'm in theory constructing a hierarchical tree of memory usage information. What os.walk() gives you is basically a linear walk, so if you want a tree reconstructing it is up to you. It's also more awkward to cut off walking down the tree if various conditions are met (or not met), especially if one of the conditions is 'my memory usage is the same as my parent's memory usage'. If what I want is really a tree, then I should probably walk the directory hierarchy myself (and pass each step its parent node, already loaded with memory information, and so on).
On the third hand, the actual way this information will be printed out is as a (sorted) linear list, so if I build a tree I'll have to linearize it later. Using os.walk() linearizes it for me in advance, and I can then readily sort it into some order. I do need to know certain information about parents, but I could put that in a dict that maps (dir)paths to their corresponding data object (since I'm walking top down I know that the parent will always be visited before the children).
A lot of these choices come down to what will be more convenient to code up, and these choices exist at all because of the hammer of os.walk(). Given the hammer, I saw the problem as a nail even though maybe it's a screw, and now I've realized I can't see what I have. Probably the only way to do so is to write one or the other version of the code and see how it goes. Why haven't I done that, and instead set aside the whole of memdu.py? That's because I don't want to 'waste time' by writing the 'wrong' version, which is irrational. But here I am.
(Having written this entry I've probably talked myself into going ahead with the os.walk version. If the code starts feeling awkward, much of what I've built will probably be reusable for a tree version.)
PS: This isn't the first time I've been blinded by a Python feature.
2023-03-12
Getting a Python 2 virtual environment (in 2023's twilight of Python 2)
Suppose, not entirely hypothetically, that you need to create a new
Python 2 virtual environment today; perhaps you need to install
some package to see how its old Python 2 version behaves. With
Python 3, creating a virtual environment is really easy; it's just
'python3 -m venv /tmp/pytest
'. With Python 2 today, you have two
complications. First, Python 2 doesn't have a venv
module (instead
it uses a 'virtualenv
' command), and second, your installed Python
2 environment may not have all of the necessary infrastructure
already set up since people are deprecating Python 2 and cutting
down any OS provided version of it to the bare minimum.
First, you need a Python 2 version of pip. Hopefully you have one
already; if not, you want the 2.7 version of get-pip.py
, but don't count on that URL
lasting forever, as the URL in my 2021 entry on this didn't. I haven't tested this latest version,
so cross your fingers. If you still care at all about Python 2, you
probably really want to make sure you have a pip2 at this point.
Once you have a pip2 in one way or another, you want to do a user
install of 'virtualenv
', with 'pip2 install --user virtualenv
'. This
will give you a ~/.local/bin/virtualenv command, which you may want to
rename to 'virtualenv2'. You can then use this to create your virtual
environment, 'virtualenv2 /tmp/pytest
'. The result should normally
have everything you need to use the virtualenv, including a pip2, and
you can then use this virtualenv pip2 to install the package or packages
you need to poke at.
Incidentally, if you just want to get a copy of the Python 2 version of
a particular package and not specifically install it somewhere, you can
just use pip2 to download it, with 'pip2 download <whatever>
'. I'm
not sure that the result is necessarily immediately usable and you'll
have to decode it yourself ('file
' may be your friend), but depending
on what you want this may be good enough.
(I took a quick look to see if there was an easier way to find out
the last supported Python 2 version of a package than 'pip2 download
<whatever>
', but as far as I can see there isn't.)
(This is one of the entries that I write for myself so that I have this information if I ever need it again, although I certainly hope not to.)
PS: Another option is to use the Python 2.7 version of PyPy, which I believe comes pre-set with its own pip2, although not its own already installed virtualenv. Depending on how concerned you are about differences in behavior between CPython 2.7 and PyPy 2.7, this might not be a good option.
2023-02-20
A bit on unspecified unique objects in Python
In Why Aren't Programming Language Specifications Comprehensive? (via), Laurence Tratt shows the following example of a difference in behavior between CPython and PyPy:
$ cat diffs.py print(str(0) is str(0)) $ python3 diffs.py False $ pypy diffs.py True
Tratt notes that Python's language specification doesn't specify
the behavior here, so both implementations are correct. Python does
this to preserve the ability of implementations to make different
choices, and Tratt goes on to use the example of __del__
destructors.
This might leave a reader who is willing to accept the difference in
destructor behavior to wonder why Python doesn't standardize object
identity here.
Since this code uses 'is
', the underlying reason for the difference
in behavior is whether two invocations of 'str(0)
' in one expression
result in the same actual object. In CPython 3, they don't; in PyPy,
they do. On the one hand, making these two invocations create the
same object is an obvious win, since you're creating less objects
and thus less garbage. A Python implementation could do this by
knowing that using str() on a constant results in a constant result
so it only needs one object, or it could intern the
results of expressions like 'str(0)' so that they always return the
same object regardless of where they're invoked. So allowing this
behavior is good for Python environments that want to be nicely
optimized, as PyPy does.
On the other hand, doing either of these things (or some combination of them) is extra work and complexity in an implementation. Depending on the path taken to this optimization, you have to either decide what to intern and when, then keep track of it all, or build in knowledge about the behavior of the built in str() and then verify at execution time that you're using the builtin instead of some clever person's other version of str(). Creating a different str() function or class here would be unusual but it's allowed in Python, so an implementation has to support it. You can do this analysis, but it's extra work. So not requiring this behavior is good for implementations that don't want to have the code and take the (extra) time to carefully do this analysis.
This is of course an example of a general case. Languages often
want to allow but not require optimizations, even when these
optimizations can change the observed behavior of programs (as they
do here). To allow this, careful language specifications set up
explicit areas where the behavior isn't fixed, as Python does
with is
(see the footnote).
In fact, famously CPython doesn't even treat all types of objects
the same:
$ cat diff2.py print(int('0') is int('0')) $ python3 diff2.py True $ pypy diff2.py True
Simply changing the type of object changes the behavior of CPython. For that matter, how we create the object can change the behavior too:
$ cat diff3.py print(chr(48) == str(0)) print(chr(48) is chr(48)) print(chr(48) is str(0)) $ python3 diff3.py True True False
Both 'chr(48)' and 'str(0)' create the same string value, but only one of them results in the same object being returned by multiple calls. All of this is due to CPython's choices about what it optimizes and what it doesn't. These choices are implementation specific and also can change over time, as the implementation's views change (which is to say as the views of CPython's developers change).
2023-01-09
In Python, zero is zero regardless of the number type
I recently saw a Fediverse post by Mike Samuel with a Python pop quiz that tripped me up:
@shriramk Since I know you appreciate Python pop quizzes:
my_heterogeneous_map = { ( 0.0): "positive zero", ( -0.0): "negative zero", ( 0): "integer zero", } print("my_heterogeneous_map=%r\n" % my_heterogeneous_map) del my_heterogeneous_map[False] print("my_heterogeneous_map=%r\n" % my_heterogeneous_map)
Before I actually tried it, I expect the dict to start out with
either two or three entries and end up with one or two, given that
boolean True
and False
are actually ints with
False
being the same as zero. In fact the dict starts out with
one entry and ends up with none, because in Python all three of
these zeros are equal to each other:
>>> 0.0 == -0.0 == 0 True
(This is sort of the inversion of how NaNs behave as keys in dictionaries.)
In fact this goes further. A complex number zero is equal to plain zero:
>>> complex(0,0) == 0.0 True >>> complex(0,-0.0) == 0.0 True >>> complex(-0.0,-0.0) == 0.0 True
(All three of those are different complex numbers, as you can see by printing them all, although they all compare equal to each other.)
However this is simply one instance of a general case with how Python has chosen to treat complex numbers (as well as comparisons between integers and floats):
>>> complex(1,0) == 1 True >>> complex(20,0) == 20 True
This particular behavior for complex numbers doesn't seem to be explicitly described in the specification. Numeric Types — int, float, complex says about arithmetic operators and comparisons on mixed types:
Python fully supports mixed arithmetic: when a binary arithmetic operator has operands of different numeric types, the operand with the “narrower” type is widened to that of the other, where integer is narrower than floating point, which is narrower than complex. A comparison between numbers of different types behaves as though the exact values of those numbers were being compared.
I suppose that Python would say that the 'exact value' of a complex number with a 0 imaginary component is its real component. The equality comparison for complex numbers does at least make sense given that '20 + complex(0,0)' is '(20+0j)', or to put it another way, '20 - complex(20,0)' is (0j) and Python would probably like that to compare equal to the other versions of zero. If 'a - b == 0' but 'a != b', it would feel at least a little bit odd.
(Of course you can get such a situation with floating point numbers, but floating point numbers do odd and counter-intuitive things that regularly trip people up.)
This explanation of comparison, including equality, makes sense for 0.0 being equal to 0 (and in fact for all floating point integral values, like 20.0, being equal to their integer version; the exact value of '20.0' is the same as the exact value of '20'). As for -0.0, it turns out that the IEEE 754 floating point standard says that it should compare equal to 0.0 (positive zero), which by extension means it has the same 'exact value' as 0.0 and thus is equal to 0.
(This comes from Wikipedia's page on Signed zero).)
PS: I think the only way to detect a negative zero in Python may
be with math.copysign()
; there
doesn't appear to be an explicit function for it, the way we have
math.isinf()
and math.isnan()
.
2023-01-02
Debian has removed Python 2 from its next version
The news of the time interval is that Debian's development version has removed even the 'minimal' version of Python 2.7 (via). Among other things, this includes the 'python2-minimal' and 'python2.7-minimal' packages, both of which are gone from Debian's 'testing' pseudo-distribution as well as 'unstable'. In future Debian releases, people who want Python 2 will have to build it themselves in some way (for example, copying the binary package from the current 'bullseye' release, or copying the source package and rebuilding). We've been expecting this to happen for some time, but the exact timing was uncertain until now.
Since Ubuntu generally follows Debian for things like this, I expect that the next Ubuntu LTS release (which would normally be Ubuntu 24.04 in April of 2024) won't include Python 2 either. As I write this, the in development Ubuntu 'lunar' still contains the python2-minimal package (this is 'Lunar Lobster', expected to be 23.04, cf). With four months to go before the expected release (and less time before a package freeze), I don't know if Canonical will follow Debian and remove the python2-minimal package. I wouldn't be surprised either way.
Both Canonical and Debian keep source packages around for quite a while, so people have plenty of time to grab the source .deb for python2-minimal. Pragmatically, we might as well wait to see if Canonical or Debian release additional patch updates, although that seems pretty unlikely at this point. We're very likely to keep a /usr/bin/python2 around for our users, although who knows.
Fedora currently has a python2.7 package, but I suspect that Debian's action has started the clock ticking on its remaining lifetime. However, I haven't yet spotted a Fedora Bugzilla tracking bug about this (there are a few open bugs against their Python 2.7 package). Since I still have old Python 2 programs on my Fedora desktops that I use and don't feel like rewriting, I will probably grab the Fedora source and binary RPMs at some point to avoid having to take more drastic actions.
(This means that my guess two years ago that Fedora would move before Debian turned out to be wrong.)