2023-08-14
A brief brush with writing and using Python type hints
I was recently nerd sniped into writing a Python version of a
simple although real exercise. As part
of that nerd snipe, I decided to write my Python using type hints
(which I've been tempted by for some time).
This is my first time really trying to use type hints, and I did
it without the benefit of reading any 'quick introduction to Python
type hints' articles; I worked from vague memories of seeing the
syntax and reading the documentation for the standard library's
typing
module.
I checked my type hints with mypy, without doing anything particularly
fancy.
Looking at what I wrote now, I see I missed one trick through ignorance, which is how to declare attributes of objects. I wrote:
class X: def __init__(self) -> None: self.known_tests: list[str] = []
The idiomatic way of doing this is apparently:
class X: known_tests: list[str] def __init__(self) -> None: self.known_tests = []
I believe that mypy can handle either approach but the second is what I've seen in some recent Python articles I've read.
The declaration for '__init__
' is another thing that I had to
stumble over. Initially I didn't put any type annotations on
'__init__
' because I couldn't see anything obvious to put there,
but then mypy reported that it was a method without type annotations.
Marking it explicitly as returning None
caused mypy to be happy.
While writing the code, as short and trivial as it is, I know that I made at least one absent-minded mistake that mypy's type checking would have caught. I believe I made the mistake before I fully filled out the types, so it's possible that simply filling them out would have jogged my mind about things so I didn't slip into the mistake. In either case, having to think about types enough to write them down feels useful, on top of the type checking itself.
At the same type, typing out the types felt both bureaucratic and verbose. Some of this is because my code involves several layers of nested containers; I have tuples inside lists and being returned by a generator. However, I don't think this is too unusual, so I'd expect to want to define a layer of intermediate types in basically anything sophisticated, like this:
logEntryType = tuple[str, typing.Any]
This name exists only to make type hints happy (or, to put it the
other way, to make them less onerous to write). It's not present
in the code or used by it. Possibly this is a sign that in type
hint heavy code I'd wind up wanting to define a bunch of small data
only dataclasses,
simply so I could use these names outside of type hints. This makes
me wonder if retrofitting type hints to already written code will
be somewhat awkward, because I'd wind up wanting to structure the
data differently. In code without type hints, slinging around tuples
and lists is easy, and 'bag of various things' is a perfectly okay
data structure. In code with type hints, I suspect all of that may
get awkward in the same way this 'logEntryType
' is.
Despite having gone through this exercise, I'm not sure how I feel about using type hints in Python. I suspect that I need to write something more substantial with type hints, or try to retrofit some of our existing code with them, or both, before I can have a really solid view. But at the very least they didn't make me dislike the experience.
2023-07-03
Our Python fileserver management code has been quite durable over the years
At this point we've been running our ZFS based NFS fileserver environment for about fifteen years, starting with Solaris 10 and evolving over time to the current Ubuntu 22.04 based servers. Over that time we've managed the various iterations of the same basic thing primarily through a local set of programs (all with names starting in 'san', despite the fact that we don't have a SAN any more). These programs have always been written in Python. They started out as Python 2 programs on Solaris and then OmniOS, and were moved to Python 3 when we moved to our Linux based fileservers. Naturally, we have version control history for the Python code of these tools that goes all the way back to the first versions in 2008.
(For reasons, the Solaris and the Linux versions are in different source repositories.)
I was recently working on these programs,
which made me curious to see how much the current ones have changed
from the very first versions. The answer turns out to be not very
much, and only in two real areas. The first is that in the change
from Python 2 to Python 3, we stopped using pychecker annotations
and the optparse
module, switching to argparse
(and making a
few other Python 3 changes). The second is that when we moved from
the OmniOS fileserver generation
to the Linux fileserver generation,
we moved from using iSCSI disks that came from iSCSI backends (and
Solaris/OmniOS device names) to using locally attached disks with,
naturally, Linux device names. Otherwise, the code is almost entirely
the same. Well, for features that have always existed, since we
added features to the tools over time. But even there, most of the
features were present by the end of the OmniOS era, and their code
mostly hasn't changed between then and now.
(In some programs, more comments changed than code did. This has
left some vaguely amusing artifacts behind, like a set of local
variables cryptically called 'bh
', 'bd
', and 'bl
', which
were originally short for 'backend host/disk/lun'. We no longer
have hosts or LUNs, but we still have things that fill in those
slots and I never renamed the local variables.)
On the one hand, this is what you'd want in a programming language; when and if there's change, it's because you're changing what the code does and how it does it, not because the language environment has changed or works differently on different systems. On the other hand, these days it feels like some programming environments exist in a constant state of churn, with old code either directly obsolete or functionally obsolete within a few years due to changes around it. Python hasn't been without such changes (see Python 2 to Python 3), but in practice a lot of code really has carried on basically as-is. This is something we rather appreciate in our local tools, because our real goal isn't to write and maintain tools, it's to do things with them.
2023-06-27
Belatedly remembering to use the two expression form of Python's assert
Today I confessed on the Fediverse that I had somehow
mentally overwritten what I once knew about Python's assert
with a C-like version that I wrote as
'assert(expression)
' (which I apparently started doing more
than a decade ago). What caused me to notice
this was that I was revising some Python code to cope with a new
situation, and I decided I wanted to fail in some way if an impossible
thing turned out to not be as impossible as I thought. This wasn't
an error that should be returned normally, and it wasn't really
something I wanted to raise as an assertion, so adding an assert
was the easy way.
So at first I wrote 'assert(2 <= n <= 23)
', and then in my way
deliberately forced the assert to fail to test things. This caused
me to change the variable name to make the assert slightly more
informational, as 'assert(2 <= disknum <= 23)
'. This gave a better
clue about what the assert was about, but it didn't say what was
wrong. Thinking about how to fix that caused a dim flickering light
to appear over my head and sent me off to read the specification
of assert
,
which told me about the two expression version and also reminded
me that assert
is a statement, not a C-like function call.
(My new use of assert
in my code hopefully includes enough
information about the surrounding context that I can see what
went wrong, if something does. It won't give me everything but
these are quick, low-effort checks that I don't expect to ever
trigger.)
Now that I've re-discovered this full form of assert
, my goal is
to use it more often for "this is never expected to happen" safety
checks in my code. Putting in a single line of an assert
can
convert an otherwise mysterious failure (like the famous 'NoneType
object has no attribute ...' error) into a more explicit one, and
prevent my code going off the rails in cases where it might not
fail immediately.
(I know, CPython will strip out these assert
statements if we
ever run with optimization enabled. We're unlikely to ever do that
for these Python programs.)
As a side note, in general Python's syntax allows for both putting
unnecessary ()'s around expressions and then not having a space between
a statement and an expression. This allows what would normally be
'assert expr
' to be transformed into 'assert(expr)
', so that it
looked like a function call to me. Fortunately there are only a few
simple statements that can even be potentially confused this way, and
I suspect I'm not likely to imagine 'raise
' or 'yield
' could be
function calls (or 'return
').
(You can write some complex statements this way, such as 'if(expr):
',
but then the ':' makes it clear that you have a statement, not a
function call.)
2023-04-30
Os.walk, the temptation of hammers, and the paralysis of choice
I have a shell script to give me a hierarchical, du-like report of memory usage broken down by Linux cgroup. Even back when I wrote it, it really needed to be something other than a shell script, and a recent addition made it quite clear that the time had come (the shell script version is now both slow and inflexible). So as is my habit, I opened up a 'memdu.py' in my editor and started typing. Some initial functions were easy, until I got to here:
def walkcgroup(top): for dirpath, dirnames, filenames in os.walk(top, topdown=True): if memstatfile not in filenames: dirnames[:] = [] continue
Then I stopped typing because I realized I had a pile of choices of make about exactly how this program was going to be structured, and maybe I didn't want to use os.walk(), as shown by the very first thing I wrote inside the for loop.
The reason I started writing code with os.walk() is because it's the obvious hammer to use when you want to walk all over a directory tree, such as /sys/fs/cgroup. But on the other hand, I realized that I'm not just visiting each directory; I'm in theory constructing a hierarchical tree of memory usage information. What os.walk() gives you is basically a linear walk, so if you want a tree reconstructing it is up to you. It's also more awkward to cut off walking down the tree if various conditions are met (or not met), especially if one of the conditions is 'my memory usage is the same as my parent's memory usage'. If what I want is really a tree, then I should probably walk the directory hierarchy myself (and pass each step its parent node, already loaded with memory information, and so on).
On the third hand, the actual way this information will be printed out is as a (sorted) linear list, so if I build a tree I'll have to linearize it later. Using os.walk() linearizes it for me in advance, and I can then readily sort it into some order. I do need to know certain information about parents, but I could put that in a dict that maps (dir)paths to their corresponding data object (since I'm walking top down I know that the parent will always be visited before the children).
A lot of these choices come down to what will be more convenient to code up, and these choices exist at all because of the hammer of os.walk(). Given the hammer, I saw the problem as a nail even though maybe it's a screw, and now I've realized I can't see what I have. Probably the only way to do so is to write one or the other version of the code and see how it goes. Why haven't I done that, and instead set aside the whole of memdu.py? That's because I don't want to 'waste time' by writing the 'wrong' version, which is irrational. But here I am.
(Having written this entry I've probably talked myself into going ahead with the os.walk version. If the code starts feeling awkward, much of what I've built will probably be reusable for a tree version.)
PS: This isn't the first time I've been blinded by a Python feature.
2023-03-12
Getting a Python 2 virtual environment (in 2023's twilight of Python 2)
Suppose, not entirely hypothetically, that you need to create a new
Python 2 virtual environment today; perhaps you need to install
some package to see how its old Python 2 version behaves. With
Python 3, creating a virtual environment is really easy; it's just
'python3 -m venv /tmp/pytest
'. With Python 2 today, you have two
complications. First, Python 2 doesn't have a venv
module (instead
it uses a 'virtualenv
' command), and second, your installed Python
2 environment may not have all of the necessary infrastructure
already set up since people are deprecating Python 2 and cutting
down any OS provided version of it to the bare minimum.
First, you need a Python 2 version of pip. Hopefully you have one
already; if not, you want the 2.7 version of get-pip.py
, but don't count on that URL
lasting forever, as the URL in my 2021 entry on this didn't. I haven't tested this latest version,
so cross your fingers. If you still care at all about Python 2, you
probably really want to make sure you have a pip2 at this point.
Once you have a pip2 in one way or another, you want to do a user
install of 'virtualenv
', with 'pip2 install --user virtualenv
'. This
will give you a ~/.local/bin/virtualenv command, which you may want to
rename to 'virtualenv2'. You can then use this to create your virtual
environment, 'virtualenv2 /tmp/pytest
'. The result should normally
have everything you need to use the virtualenv, including a pip2, and
you can then use this virtualenv pip2 to install the package or packages
you need to poke at.
Incidentally, if you just want to get a copy of the Python 2 version of
a particular package and not specifically install it somewhere, you can
just use pip2 to download it, with 'pip2 download <whatever>
'. I'm
not sure that the result is necessarily immediately usable and you'll
have to decode it yourself ('file
' may be your friend), but depending
on what you want this may be good enough.
(I took a quick look to see if there was an easier way to find out
the last supported Python 2 version of a package than 'pip2 download
<whatever>
', but as far as I can see there isn't.)
(This is one of the entries that I write for myself so that I have this information if I ever need it again, although I certainly hope not to.)
PS: Another option is to use the Python 2.7 version of PyPy, which I believe comes pre-set with its own pip2, although not its own already installed virtualenv. Depending on how concerned you are about differences in behavior between CPython 2.7 and PyPy 2.7, this might not be a good option.
2023-02-20
A bit on unspecified unique objects in Python
In Why Aren't Programming Language Specifications Comprehensive? (via), Laurence Tratt shows the following example of a difference in behavior between CPython and PyPy:
$ cat diffs.py print(str(0) is str(0)) $ python3 diffs.py False $ pypy diffs.py True
Tratt notes that Python's language specification doesn't specify
the behavior here, so both implementations are correct. Python does
this to preserve the ability of implementations to make different
choices, and Tratt goes on to use the example of __del__
destructors.
This might leave a reader who is willing to accept the difference in
destructor behavior to wonder why Python doesn't standardize object
identity here.
Since this code uses 'is
', the underlying reason for the difference
in behavior is whether two invocations of 'str(0)
' in one expression
result in the same actual object. In CPython 3, they don't; in PyPy,
they do. On the one hand, making these two invocations create the
same object is an obvious win, since you're creating less objects
and thus less garbage. A Python implementation could do this by
knowing that using str() on a constant results in a constant result
so it only needs one object, or it could intern the
results of expressions like 'str(0)' so that they always return the
same object regardless of where they're invoked. So allowing this
behavior is good for Python environments that want to be nicely
optimized, as PyPy does.
On the other hand, doing either of these things (or some combination of them) is extra work and complexity in an implementation. Depending on the path taken to this optimization, you have to either decide what to intern and when, then keep track of it all, or build in knowledge about the behavior of the built in str() and then verify at execution time that you're using the builtin instead of some clever person's other version of str(). Creating a different str() function or class here would be unusual but it's allowed in Python, so an implementation has to support it. You can do this analysis, but it's extra work. So not requiring this behavior is good for implementations that don't want to have the code and take the (extra) time to carefully do this analysis.
This is of course an example of a general case. Languages often
want to allow but not require optimizations, even when these
optimizations can change the observed behavior of programs (as they
do here). To allow this, careful language specifications set up
explicit areas where the behavior isn't fixed, as Python does
with is
(see the footnote).
In fact, famously CPython doesn't even treat all types of objects
the same:
$ cat diff2.py print(int('0') is int('0')) $ python3 diff2.py True $ pypy diff2.py True
Simply changing the type of object changes the behavior of CPython. For that matter, how we create the object can change the behavior too:
$ cat diff3.py print(chr(48) == str(0)) print(chr(48) is chr(48)) print(chr(48) is str(0)) $ python3 diff3.py True True False
Both 'chr(48)' and 'str(0)' create the same string value, but only one of them results in the same object being returned by multiple calls. All of this is due to CPython's choices about what it optimizes and what it doesn't. These choices are implementation specific and also can change over time, as the implementation's views change (which is to say as the views of CPython's developers change).
2023-01-09
In Python, zero is zero regardless of the number type
I recently saw a Fediverse post by Mike Samuel with a Python pop quiz that tripped me up:
@shriramk Since I know you appreciate Python pop quizzes:
my_heterogeneous_map = { ( 0.0): "positive zero", ( -0.0): "negative zero", ( 0): "integer zero", } print("my_heterogeneous_map=%r\n" % my_heterogeneous_map) del my_heterogeneous_map[False] print("my_heterogeneous_map=%r\n" % my_heterogeneous_map)
Before I actually tried it, I expect the dict to start out with
either two or three entries and end up with one or two, given that
boolean True
and False
are actually ints with
False
being the same as zero. In fact the dict starts out with
one entry and ends up with none, because in Python all three of
these zeros are equal to each other:
>>> 0.0 == -0.0 == 0 True
(This is sort of the inversion of how NaNs behave as keys in dictionaries.)
In fact this goes further. A complex number zero is equal to plain zero:
>>> complex(0,0) == 0.0 True >>> complex(0,-0.0) == 0.0 True >>> complex(-0.0,-0.0) == 0.0 True
(All three of those are different complex numbers, as you can see by printing them all, although they all compare equal to each other.)
However this is simply one instance of a general case with how Python has chosen to treat complex numbers (as well as comparisons between integers and floats):
>>> complex(1,0) == 1 True >>> complex(20,0) == 20 True
This particular behavior for complex numbers doesn't seem to be explicitly described in the specification. Numeric Types — int, float, complex says about arithmetic operators and comparisons on mixed types:
Python fully supports mixed arithmetic: when a binary arithmetic operator has operands of different numeric types, the operand with the “narrower” type is widened to that of the other, where integer is narrower than floating point, which is narrower than complex. A comparison between numbers of different types behaves as though the exact values of those numbers were being compared.
I suppose that Python would say that the 'exact value' of a complex number with a 0 imaginary component is its real component. The equality comparison for complex numbers does at least make sense given that '20 + complex(0,0)' is '(20+0j)', or to put it another way, '20 - complex(20,0)' is (0j) and Python would probably like that to compare equal to the other versions of zero. If 'a - b == 0' but 'a != b', it would feel at least a little bit odd.
(Of course you can get such a situation with floating point numbers, but floating point numbers do odd and counter-intuitive things that regularly trip people up.)
This explanation of comparison, including equality, makes sense for 0.0 being equal to 0 (and in fact for all floating point integral values, like 20.0, being equal to their integer version; the exact value of '20.0' is the same as the exact value of '20'). As for -0.0, it turns out that the IEEE 754 floating point standard says that it should compare equal to 0.0 (positive zero), which by extension means it has the same 'exact value' as 0.0 and thus is equal to 0.
(This comes from Wikipedia's page on Signed zero).)
PS: I think the only way to detect a negative zero in Python may
be with math.copysign()
; there
doesn't appear to be an explicit function for it, the way we have
math.isinf()
and math.isnan()
.
2023-01-02
Debian has removed Python 2 from its next version
The news of the time interval is that Debian's development version has removed even the 'minimal' version of Python 2.7 (via). Among other things, this includes the 'python2-minimal' and 'python2.7-minimal' packages, both of which are gone from Debian's 'testing' pseudo-distribution as well as 'unstable'. In future Debian releases, people who want Python 2 will have to build it themselves in some way (for example, copying the binary package from the current 'bullseye' release, or copying the source package and rebuilding). We've been expecting this to happen for some time, but the exact timing was uncertain until now.
Since Ubuntu generally follows Debian for things like this, I expect that the next Ubuntu LTS release (which would normally be Ubuntu 24.04 in April of 2024) won't include Python 2 either. As I write this, the in development Ubuntu 'lunar' still contains the python2-minimal package (this is 'Lunar Lobster', expected to be 23.04, cf). With four months to go before the expected release (and less time before a package freeze), I don't know if Canonical will follow Debian and remove the python2-minimal package. I wouldn't be surprised either way.
Both Canonical and Debian keep source packages around for quite a while, so people have plenty of time to grab the source .deb for python2-minimal. Pragmatically, we might as well wait to see if Canonical or Debian release additional patch updates, although that seems pretty unlikely at this point. We're very likely to keep a /usr/bin/python2 around for our users, although who knows.
Fedora currently has a python2.7 package, but I suspect that Debian's action has started the clock ticking on its remaining lifetime. However, I haven't yet spotted a Fedora Bugzilla tracking bug about this (there are a few open bugs against their Python 2.7 package). Since I still have old Python 2 programs on my Fedora desktops that I use and don't feel like rewriting, I will probably grab the Fedora source and binary RPMs at some point to avoid having to take more drastic actions.
(This means that my guess two years ago that Fedora would move before Debian turned out to be wrong.)
2022-12-09
Sometimes an Ubuntu package of a Python module is probably good enough
Recently I ran across a Python program we're interested in, and discovered that it required prometheus-client. Normally this would mean creating a virtual environment and installing the module into it, possibly along with the program (although you don't have to put the program inside the venv it uses). But when I looked out of curiosity, I saw that Ubuntu packages this module, which got me to thinking.
I'm generally a sceptic of relying on the Ubuntu packaged version of a Python module (or any OS's packaged version); I wrote about this years ago in the context of Django. Linux distribution packaging of Python modules is famously out of date, and Ubuntu makes it worse by barely fixing bugs at the best of times. However, this feels like a somewhat different situation. The program isn't doing anything much with the prometheus-client module, and the module itself isn't very demanding; probably quite a lot of versions will do, and there's unlikely to be a bug that affects us. Indeed, some quick testing of the program with the Ubuntu version suggests that it works fine.
(Although now that I look, the Ubuntu version is rather out of date. Ubuntu 22.04 LTS packages 0.9.0, from 2020-11-26, and right now according to the module's releases page it's up to 0.15.0, with quite a few changes.)
Provided that Ubuntu's version of the module works, which it seems to, using the Ubuntu packaged version is the easy path. It's not an ideal situation, but for something with simple needs (and which isn't a high priority), it's rather tempting to say that it's okay. And if the Ubuntu version proves unsatisfactory, changing over to the latest version in a virtual environment is (at one level) only a matter of changing the path to Python 3 in the program's '#!' line.
(We have another program that requires a Python module, pyserial, that we get from Ubuntu, but I didn't think about it much at the time. This time around I first built a scratch venv for the program to test it, then discovered the Ubuntu package.)
2022-12-08
Python version upgrades and deprecations
Recently I read Itamar Turner-Trauring's It’s time to stop using
Python 3.7
(via). On the
one hand, this is pragmatic advice, because as the article mentions
Python 3.7 is reaching its end of life as of June 2023. On the other
hand it gives me feelings, and one of the feelings is that the
Python developers are not making upgrades any easier by slowly
deprecating various standard library modules. Some of these modules
are basically obsolete now, but some are not and have no straightforward
replacement, such as the cgi
module.
The Python developers can do whatever they want to do (that's the power of open source), and they clearly want to move Python forward (as they see it) by cleaning up the standard library. But this means that they are perfectly willing to break backward compatibility in Python 3, at least for the standard library.
One of the things that make upgrading versions of anything easy is if the new version is a drop in replacement for the old one. Deprecating and eventually removing things in new versions means that new versions are not drop in replacements, which means that it makes upgrading harder. When upgrading is harder, (more) people put it off or don't do it. This happens regardless of what the software authors like or think, because people are people.
I doubt this is a direct factor in people still using Python 3.7. But I can't help but think that the Python developers' general attitude toward backward compatibility doesn't help.
(Python virtual environments make different versions of Python not exactly a drop in replacement; in practice you're going to want to rebuild the venv. But my impression is that pretty much everyone who is seriously using Python with venvs has the tools and experience to do that relatively easily, because their venvs are automatically built from specifications. Someday I need to master doing that myself, because sooner or later we're going to need to use venvs and be able to migrate between Python versions as part of an OS upgrade.)
2022-11-19
Python dictionaries and floating point NaNs as keys
Like Go, Python's floating point numbers support NaNs with the usual IEEE-754 semantics, including not comparing equal to each other. Since Python will conveniently produce them for us, we can easily demonstrate this:
>>> k = float('nan') >>> k == k False >>> k is k True
Yesterday, I discovered that Go couldn't delete 'NaN' keys from maps (the Go version of dicts). If you initially try this in Python, it may look like it works:
>>> d = {k: "Help"} >>> d {nan: 'Help'} >>> d[k] 'Help' >>> del d[k] >>>
However, all is not what it seems:
>>> d = {k: "Help", float('nan'): "Me"} >>> d {nan: 'Help', nan: 'Me'} >>> d[float('nan')] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: nan
What's going on here is that Python dict indexing has a fast path for object identity, which comes into play when you look up something using exactly the same object that you used to set an entry. When you set a dict entry, Python saves the object you used as the key. If you ask a dict to look up an entry using that exact object, Python doesn't even bother calling the object's equality operation (what would be used for an '==' check); it just returns the value. This means that floating point NaNs have no chance to object that they're never equal to each other, and lookup will succeed. However, if you use a different object that is also a NaN, the lookup will fail because two NaNs never compare equal to each other.
This use of object identity in dict lookups does mean that the Python equivalent of iterating a Go map will always work:
>>> for k in d.keys(): ... d[k] ... 'Help' 'Me'
When you ask a dictionary for its keys, you of course get the literal Python objects that are the keys, which can always be used to look up the corresponding entry in the dict even if they're NaNs or otherwise uncomparable or inequal under normal circumstances.
One of the other things that this starts to show us is that Python is not making any attempt to intern NaNs, unlike things like True, False, and small integers. Let's show that more thoroughly:
>>> k2 = float('nan') >>> k is k2 False >>> k is math.nan False
It might be hard to make all NaNs generated through floating point
operations be the same interned object, but it would be relatively
straightforward to make 'float("nan")
' always produce the same
Python object and for that Python object to also be math.nan
. But
Python doesn't do either of those; every NaN is a unique object.
Personally I think that this is the right choice (whether or not it's
deliberate); NaNs are supposed to all be different from each other
anyway, so using separate objects is slightly better.
(I suspect that Python doesn't intern any floating point numbers, but I haven't checked the source code. On a quick check it doesn't intern 0.0 or +Inf; I didn't try any others. In general, I expect that interning floating point numbers makes much less sense and would result in much less object reuse than interning small integers and so on does.)