Wandering Thoughts

2025-06-10

Python argparse has a limitation on argument groups that makes me sad

Argparse is the straightforward standard library module for handling command line arguments, with a number of nice features. One of those nice features is groups of mutually exclusive arguments. If people can only give one of '--quiet' and '--verbose' and both together make no sense, you can put them in a mutually exclusive group and argparse will check for you and generate an appropriate error. However, mutually exclusive groups have a little limitation that makes me sad.

Suppose, not hypothetically, that you have a Python program that has some timeouts. You'd like people using the program to be able to adjust the various sorts of timeouts away from their default values and also to be able to switch it to a mode where it never times out at all. Generally it makes no sense to adjust the timeouts and also to say not to have any timeouts, so you'd like to put these in a mutually exclusive group. If you have only a single timeout, this works fine; you can have a group with '--no-timeout' and '--timeout <TIME>' and it works. However, if you have multiple sorts of timeouts that people may want adjust all of, this doesn't work. If you put all of the options in a single mutually exclusive group, people can only adjust one timeout, not several of them. What you want is for the '--no-timeouts' switch to be mutually exclusive with a group of all of the timeout switches.

Unfortunately, if you read the current argparse documentation, you will find this note:

Changed in version 3.11: Calling add_argument_group() or add_mutually_exclusive_group() on a mutually exclusive group is deprecated. These features were never supported and do not always work correctly. The functions exist on the API by accident through inheritance and will be removed in the future.

You can nest a mutually exclusive group inside a regular group, and there are some uses for this. But you can't nest any sort of group inside a mutually exclusive group (or a regular group inside of a regular group). At least not officially, and there are apparently known issues with doing so that won't ever be fixed, so you probably shouldn't do it at all.

Oh well, it would have been nice.

(I suspect one reason that this isn't officially supported is that working out just what was conflicting with what in a pile of nested groups (and what error message to emit) might be a bit complex and require explicit code to handle this case.)

As an extended side note, checking this by hand isn't necessarily all that easy. If you have something, such as timeouts, that have a default value but can be changed by the user, the natural way to set them up in argparse is to make the argparse default value your real default value and then use the value argparse sets in your program. If the person running the program used the switch, you'll get their value, and if not you'll get your default value, and everything works out. Unfortunately this usage makes it difficult or impossible to see if the person running your program explicitly gave a particular switch. As far as I know, argparse doesn't expose this information, so at a minimum you have to know what your default value is and then check to see if the current value is different (and this doesn't catch the admittedly unlikely case of the person using the switch with the default value).

ArgparseAndNestedGroups written at 22:28:12;

2025-06-06

Adding your own attributes to Python functions and Python typing

Every so often I have some Python code where I have a collection of functions and along with the functions, some additional information about them. For example, the functions might implement subcommands and there might be information about help text, the number of command line arguments, and so on. There are a variety of approaches for this, but a very simple one I've tended to use is to put one or more additional attributes on the functions. This looks like:

def dosomething(....):
  [....]
dosomething._cmdhelp_ = "..."

(These days you might use a decorator on the function instead of explicitly attaching attributes, but I started doing this before decorators were as much of a thing in Python.)

Unfortunately, as I've recently discovered this pattern is one that (current) Python type checkers don't really like. For perfectly rational reasons, Python type checkers like to verify that every attribute you're setting on something actually exists and isn't, say, a typo. Functions don't normally have random user-defined attributes and so type checkers will typically complain about this (as I found out in the course of recent experiments with new type checkers).

For new code, there are probably better patterns these days, such as writing a decorator that auto-registers the subcommand's function along with its help text, argument information, and so on. For existing code, this is a bit annoying, although I can probably suppress the warnings. It would be nice if type checkers understood this idiom but adding your own attributes to individual functions (or other standard Python types) is probably so rare there's no real point.

(And it's not as if I'm adding this attribute to all functions in my program, only to the ones that implement subcommands.)

The moral that I draw from this is that old code that I may want to use type-inferring type checkers on (cf) may have problems beyond missing type hints and needing some simple changes to pacify the type checkers. It's probably not worth doing a big overhaul of such code to modernize it. Alternately, perhaps I want to make the code simpler and less tricky, even though it's more verbose to write (for example, explicitly listing all of the subcommand functions along with their help text and other things in a dict). The more verbose version will be easier for me in the future (or my co-workers) to follow, even if it's less clever and more typing up front.

FunctionAttributesVsTyping written at 22:05:01;

2025-06-04

Python type checkers work in different ways and can check different things

For all of the time so far that I've been poking at Python's type checking, I've known that there was more than one program for type checking but I've basically ignored that and used mypy. My understanding was that mypy was the first Python type checker and the only fully community-based one, with the other type checkers the product of corporations and sometimes at least partially tied to things like Microsoft's efforts to get everyone hooked on VSCode, and I assumed that the type checkers mostly differed in things like speed and what they integrated with. Recently, I read Pyrefly vs. ty: Comparing Python’s Two New Rust-Based Type Checkers (via) and discovered that I was wrong about this, and at least some Python type checkers work quite differently from mypy in ways that matter to me.

Famously (for those who've used it), mypy really wants you to add explicit typing information to your code. I believe it has some ability to deduce types for you, but at least for me it doesn't do very much to our programs without types (although part of this is that I need to turn on '--check-untyped-defs'). Other type checkers are more willing to be aggressive about deducing types from your code without explicit typing information. This is potentially interesting to me because we have a lot of code without types at work and we'll probably never add explicit type hints to it. Being able to use type checking to spot potential errors in this un-hinted code would be useful, if the various type checkers can understand the code well enough.

(In quick experiments, some of the type checkers need some additional hints, like explicitly initializing objects with the right types instead of 'None' or adding asserts to tell them that values are set. In theory they could deduce this stuff from code flow analysis, although in some cases it might need relatively sophisticated value propagation.)

Discovering this means that I'm at least going to keep my eye on the alternate type checkers, and maybe add some little bits to our programs to make the type checkers happier with things. These are early days for both of the new ones from the article and my experiments suggest that some of their deduced typing is off some of the time, but I can hope that will improve with more development.

(There are also some idioms that bits of our code use that probably will never be fully accepted by type checkers, but that's another entry.)

PS: My early experiments didn't turn up anything in the code that I tried it on, but then this code is already running stably in production. It would be a bit weird to discover a significant type confusion bug in any of it at this point. Still, checking is reassuring, especially about sections of the code that aren't exercised very often, such as error handling.

MultipleTypeCheckingWays written at 23:18:30;

2025-05-19

Python, type hints, and feeling like they create a different language

At this point I've only written a few, relatively small programs with type hints. At times when doing this, I've wound up feeling that I was writing programs in a language that wasn't quite exactly Python (but obviously was closely related to it). What was idiomatic in one language was non-idiomatic in the other, and I wanted to write code differently. This feeling of difference is one reason I've kept going back and forth over whether I should use type hints (well, in personal programs).

Looking back, I suspect that this is partly a product of a style where I tried to use typing.NewType a lot. As I found out, this may not really be what I want to do. Using type aliases (or just structural descriptions of the types) seems like it's going to be easier, since it's mostly just a matter of marking up things. I also suspect that this feeling that typed Python is a somewhat different language from plain Python is a product of my lack of experience with typed Python (which I can fix by doing more with types in my own code, perhaps revising existing programs to add type annotations).

However, I suspect some of this feeling of difference is that you (I) want to structure 'typed' Python code differently than untyped code. In untyped Python, duck typing is fine, including things like returning None or some meaningful type, and you can to a certain extent pass things around without caring what type they are. In this sort of situation, typed Python has pushed me toward narrowing the types involved in my code (although typing.Optional can help here). Sometimes this is a good thing; at other times, I wind up using '0.0' to mean 'this float value is not set' when in untyped Python I would use 'None' (because propagating the type difference of the second way through the code is too annoying). Or to put it another way, typed Python feels less casual, and there are good and bad aspects to this.

Unfortunately, one significant source of Python code that I work on is effectively off limits for type hints, and that's the Python code I write for work. For that code, I need to stick to the subset of Python that my co-workers know and can readily understand, and that subset doesn't include Python's type hints. I could try to teach my co-workers about type hints, but my view is that if I'm wrestling with whether it's worth it, my co-workers will be even less receptive to the idea of trying to learn and remember them (especially when they look at my Python code only infrequently). If we were constantly working with medium to large Python programs where type hints were valuable for documenting things and avoiding irritating errors it would be one thing, but as it is our programs are small and we can go months between touching any Python code. I care about Python type hints and have active exposure to them, and even I have to refresh my memory on them from time to time.

(Perhaps some day type hints will be pervasive enough in third party Python code and code examples that my co-workers will absorb and remember them through osmosis, but that day isn't today.)

TypeHintsDifferentLanguage written at 22:31:23;

2025-04-28

Updating venv-based things by replacing the venv not updating it

These days, we have mostly switched over to installing third-party Python programs (and sometimes things like Django) in virtual environments instead of various past practices. This is clearly the way Python expects you to do things and increasingly problems emerge if you don't. One of the issues I've been thinking about is how we want to handle updating these programs when they release new versions, because there are two approaches.

One option would be to update the existing venv in place, through various 'pip' commands. However, pip-based upgrades have some long standing issues, and also they give you no straightforward way to revert an upgrade if something goes wrong. The other option is to build a separate venv with the new version of the program (and all of its current dependency versions) and then swap the whole new venv into place, which works because venvs can generally be moved around. You can even work with symbolic links, creating a situation where you refer to 'dir/program', which is a symlink to 'dir/venvs/program-1.2.0' or 'dir/venvs/programs-1.3.0' or whatever you want today.

In practice we're more likely to have 'dir/program' be a real venv and just create 'dir/program-new', rename directories, and so on. The full scale version with always versioned directories is likely to only be used for things, like Django, where we want to be able to easily see what version we're running and switch back very simply.

Our Django versions were always going to be handled by building entirely new venvs and switching to them (it's the venv version of what we did before). We haven't had upgrades of other venv based programs until recently, and when I started thinking about it, I reached the obvious conclusion: we'll update everything by building a new venv and replacing the old one, because this deals with pretty much all of the issues at the small cost of yet more disk space for yet more venvs.

(This feels quite obvious once I'd made the decision, but I want to write it down anyway. And who knows, maybe there are reasons to update venvs in place. The one that I can think of is to only change the main program version but not any of the dependencies, if they're still compatible.)

VenvsReplaceNotUpdate written at 23:01:30;

2025-03-01

Using PyPy (or thinking about it) exposed a bug in closing files

Over on the Fediverse, I said:

A fun Python error some code can make and not notice until you run it under PyPy is a function that has 'f.close' at the end instead of 'f.close()' where f is an open()'d file.

(Normal CPython will immediately close the file when the function returns due to refcounted GC. PyPy uses non-refcounted GC so the file remains open until GC happens, and so you can get too many files open at once. Not explicitly closing files is a classic PyPy-only Python bug.)

When a Python file object is garbage collected, Python arranges to close the underlying C level file descriptor if you didn't already call .close(). In CPython, garbage collection is deterministic and generally prompt; for example, when a function returns, all of its otherwise unreferenced local variables will be garbage collected as their reference counts drop to zero. However, PyPy doesn't use reference counting for its garbage collection; instead, like Go, it only collects garbage periodically, and so will only close files as a side effect some time later. This can make it easy to build up a lot of open files that aren't doing anything, and possibly run your program out of available file descriptors, something I've run into in the past.

I recently wanted to run a hacked up version of a NFS monitoring program written in Python under PyPy instead of CPython, so it would run faster and use less CPU on the systems I was interested in. Since I remembered this PyPy issue, I found myself wondering if it properly handled closing the file(s) it had to open, or if it left it to CPython garbage collection. When I looked at the code, what I found can be summarized as 'yes and no':

def parse_stats_file(filename):
  [...]
  f = open(filename)
  [...]
  f.close

  return ms_dict

Because I was specifically looking for uses of .close(), the lack of the '()' immediately jumped out at me (and got fixed in my hacked version).

It's easy to see how this typo could linger undetected in CPython. The line 'f.close' itself does nothing but isn't an error, and then 'f' is implicitly closed in the next line, as part of the 'return', so even if you looking at this program's file descriptor usage while it's running you won't see any leaks.

(I'm not entirely a fan of nondeterministic garbage collection, at least in the context of Python, where deterministic GC was a long standing feature of the language in practice.)

PyPyExposesCloseBug written at 22:20:03;

2025-02-09

Providing pseudo-tags in DWiki through a simple hack

DWiki is the general filesystem based wiki engine that underlies this blog, and for various reasons having to do with how old it is, it lacks a number of features. One of the features that I've wanted for more than a decade has been some kind of support for attaching tags to entries and then navigating around using them (although doing this well isn't entirely easy). However, it was always a big feature, both in implementing external files of tags and in tagging entries, and so I never did anything about it.

Astute observers of Wandering Thoughts may have noticed that some years ago, it acquired some topic indexes. You might wonder how this was implemented if DWiki still doesn't have tags (and the answer isn't that I manually curate the lists of entries for each topic, because I'm not that energetic). What happened is that when the issue was raised in a comment on an entry, I realized that I sort of already had tags for some topics because of how I formed the 'URL slugs' of entries (which are their file names). When I wrote about some topics, such as Prometheus, ZFS, or Go, I'd almost always put that word in the wikiword that became the entry's file name. This meant that I could implement a low rent version of tags simply by searching the (file) names of entries for words that matched certain patterns. This was made easier because I already had code to obtain the general list of file names of entries since that's used for all sorts of things in a blog (syndication feeds, the front page, and so on).

That this works as well as it does is a result of multiple quirks coming together. DWiki is a wiki so I try to make entry file names be wikiwords, and because I have an alphabetical listing of all entries that I look at regularly, I try to put relevant things in the file name of entries so I can find them again and all of the entries about a given topic sort together. Even in a file based blog engine, people don't necessarily form their file names to put a topic in them; you might make the file name be a slug-ized version of the title, for example.

(The actual implementation allows for both positive and negative exceptions. Not all of my entries about Go have 'Go' as a word, and some entries with 'Go' in their file name aren't about Go the language, eg.)

Since the implementation is a hack that doesn't sit cleanly within DWiki's general model of the world, it has some unfortunate limitations (so far, although fixing them would require more hacks). One big one is that as far as the rest of DWiki is concerned, these 'topic' indexes are plain pages with opaque text that's materialized through internal DWikiText rendering. As such, they don't (and can't) have Atom syndication feeds, the way proper fully supported tags would (and you can't ask for 'the most recent N Go entries', and so on; basically there are no blog-like features, because they all require directories).

One of the lessons I took from the experience of hacking pseudo-tag support together was that as usual, sometimes the perfect (my image of nice, generalized tags) is the enemy of the good enough. My solution for Prometheus, ZFS, and Go as topics isn't at all general, but it works for these specific needs and it was easy to put together once I had the idea. Another lesson is that sometimes you have more data than you think, and you can do a surprising amount with it once you realize this. I could have implemented these simple tags years before I did, but until the comment gave me the necessary push I just hadn't thought about using the information that was already in entry names (and that I myself used when scanning the list).

DWikiSimpleTagSolution written at 22:56:03;

2025-01-21

A change in the handling of PYTHONPATH between Python 3.10 and 3.12

Our long time custom for installing Django for our Django based web application was to install it with 'python3 setup.py install --prefix /some/where', and then set a PYTHONPATH environment variable that pointed to /some/where/lib/python<ver>/site-packages. Up through at least Python 3.10 (in Ubuntu 22.04), you could start Python 3 and then successfully do 'import django' with this; in fact, it worked on different Python versions if you were pointing at the same directory tree (in our case, this directory tree lives on our NFS fileservers). In our Ubuntu 24.04 version of Python 3.12 (which also has the Ubuntu packaged setuptools installed), this no longer works, which is inconvenient to us.

(It also doesn't seem to work in Fedora 40's 3.12.8, so this probably isn't something that Ubuntu 24.04 broke by using an old version of Python 3.12, unlike last time.)

The installed site-packages directory contains a number of '<package>.egg' directories, a site.py file that I believe is generic, and an easy-install.pth that lists the .egg directories. In Python 3.10, strace says that Python 3 opens site.py and then easy-install.pth during startup, and then in a running interpreter, 'sys.path' contains the .egg directories. In Python 3.12, none of this happens, although CPython does appear to look at the overall 'site-packages' directory and 'sys.path' contains it, as you'd expect. Manually adding the .egg directories to a 3.12 sys.path appears to let 'import django' work, although I don't know if everything is working correctly.

I looked through the 3.11 and 3.12 "what's new" documentation (3.11, 3.12) but couldn't find anything obvious. I suspect that this is related to the removal of distutils in 3.12, but I don't know enough to say for sure.

(Also, if I use our usual Django install process, the Ubuntu 24.04 Python 3.12 installs Django in a completely different directory setup than in 3.10; it now winds up in <top level>/local/lib/python3.12/dist-packages. Using 'pip install --prefix ...' does create something where pointing PYTHONPATH at the 'dist-packages' subdirectory appears to work. There's also 'pip install --target', which I'd forgotten about until I stumbled over my old entry.)

All of this makes it even more obvious to me than before that the Python developers expect everyone to use venvs and anything else is probably going to be less and less well supported in the future. Installing system-wide is probably always going to work, and most likely also 'pip install --user', but I'm not going to hold my breath for anything else.

(On Ubuntu 24.04, obviously we'll have to move to a venv based Django installation. Fortunately you can use venvs with programs that are outside the venv.)

Pythonpath310Vs312Change written at 22:40:55;

2025-01-16

Some stuff about how Apache's mod_wsgi runs your Python apps (as of 5.0)

We use mod_wsgi to host our Django application, but if I understood the various mod_wsgi settings for how to run your Python WSGI application when I originally set it up, I've forgotten it all since then. Due to recent events, exactly how mod-wsgi runs our application and what we can control about that is now quite relevant, so I spent some time looking into things and trying to understand settings. Now it's time to write all of this down before I forget it (again).

Mod_wsgi can run your WSGI application in two modes, as covered in the quick configuration guide part of its documentation: embedded mode, which runs a Python interpreter inside a regular Apache process, and daemon mode, where one or more Apache processes are taken over by mod_wsgi and used exclusively to run WSGI applications. Normally you want to use daemon mode, and you have to use daemon mode if you want to do things like run your WSGI application as a Unix user other than the web server's normal user or use packages installed into a Python virtual environment.

(Running as a separate Unix user puts some barriers between your application's data and a general vulnerability that gives the attacker read and/or write access to anything the web server has access to.)

To use daemon mode, you need to configure one or more daemon processes with WSGIDaemonProcess. If you're putting packages (such as Django) into a virtual environment, you give an appropriate 'python-home=' setting here. Your application itself doesn't have to be in this venv. If your application lives outside your venv, you will probably want to set either or both of 'home=' and 'python-path=' to, for example, its root directory (especially if it's a Django application). The corollary to this is that any WSGI application that uses a different virtual environment, or 'home' (starting current directory), or Python path needs to be in a different daemon process group. Everything that uses the same process group shares all of those.

To associate a WSGI application or a group of them with a particular daemon process, you use WSGIProcessGroup. In simple configurations you'll have WSGIDaemonProcess and WSGIProcessGroup right next to each other, because you're defining a daemon process group and then immediately specifying that it's used for your application.

Within a daemon process, WSGI applications can run in either the main Python interpreter or a sub-interpreter (assuming that you don't have sub-interpreter specific problems). If you don't set any special configuration directive, each WSGI application will run in its own sub-interpreter and the main interpreter will be unused. To change this, you need to set something for WSGIApplicationGroup, for instance 'WSGIApplicationGroup %{GLOBAL}' to run your WSGI application in the main interpreter.

Some WSGI applications can cohabit with each other in the same interpreter (where they will potentially share various bits of global state). Other WSGI applications are one to an interpreter, and apparently Django is one of them. If you need your WSGI application to have its own interpreter, there are two ways to achieve this; you can either give it a sub-interpreter within a shared daemon process, or you can give it its own daemon process and have it use the main interpreter in that process. If you need different virtual environments for each of your WSGI applications (or different Unix users), then you'll have to use different daemon processes and you might as well have everything run in their respective main interpreters.

(After recent experiences, my feeling is that processes are probably cheap and sub-interpreters are a somewhat dark corner of Python that you're probably better off avoiding unless you have a strong reason to use them.)

You normally specify your WSGI application to run (and what URL it's on) with WSGIScriptAlias. WSGIScriptAlias normally infers both the daemon process group and the (sub-interpreter) 'application group' from its context, but you can explicitly set either or both. As the documentation notes (now that I'm reading it):

If both process-group and application-group options are set, the WSGI script file will be pre-loaded when the process it is to run in is started, rather than being lazily loaded on the first request.

I'm tempted to deliberately set these to their inferred values simply so that we don't get any sort of initial load delay the first time someone hits one of the exposed URLs of our little application.

For our Django application, we wind up with a collection of directives like this (in its virtual host):

WSGIDaemonProcess accounts ....
WSGIProcessGroup accounts
WSGIApplicationGroup %{GLOBAL}
WSGIScriptAlias ...

(This also needs a <Directory> block to allow access to the Unix directory that the WSGIScriptAlias 'wsgi.py' file is in.)

If we added another Django application in the same virtual host, I believe that the simple update to this would be to add:

WSGIDaemonProcess secondapp ...
WSGIScriptAlias ... process-group=secondapp application-group=%{GLOBAL}

(Plus the <Directory> permissions stuff.)

Otherwise we'd have to mess around with setting the WSGIProcessGroup and WSGIApplicationGroup on a per-directory basis for at least the new application. If we specify them directly in WSGIScriptAlias we can skip that hassle.

(We didn't used to put Django in a venv, but as of Ubuntu 24.04, using a venv seems the easiest way to get a particular Django version into some spot where you can use it. Our Django application doesn't live inside the venv, but we need to point mod_wsgi at the venv so that our application can do 'import django.<...>' and have it work. Multiple Django applications could all share the venv, although they'd have to use different WSGIDaemonProcess settings, or at least different names with the same other settings.)

ModWsgiHowAppsRun written at 23:13:25;

2025-01-15

(Multiple) inheritance in Python and implicit APIs

The ultimate cause of our mystery with Django on Ubuntu 24.04 is that versions of Python 3.12 before 3.12.5 have a bug where builtin types in sub-interpreters get unexpected additional slot wrappers (also), and Ubuntu 24.04 has 3.12.3. Under normal circumstances, 'list' itself doesn't have a '__str__' method but instead inherits it from 'object', so if you have a class that inherits from '(list,YourClass)' and YourClass defines a __str__, the YourClass.__str__ is what gets used. In a sub-interpreter, there is a list.__str__ and suddenly YourClass.__str__ isn't used any more.

(mod_wsgi triggers this issue because in a straightforward configuration, it runs everything in sub-interpreters.)

This was an interesting bug, and one of the things it made me realize is that the absence of a __str__ method on 'list' itself had implicitly because part of list's API. Django had set up class definitions that were 'class Something(..., list, AMixin)', where the 'AMixin' had a direct __str__ method, and Django expected that to work. This only works as long as 'list' doesn't have its own __str__ method and instead gets it through inheritance from object.__str__. Adding such a method to 'list' would break Django and anyone else counting on this behavior, making the lack of the method an implicit API.

(You can get this behavior with more or less any method that people might want to override in such a mixin class, but Python's special methods are probably especially prone to it.)

Before I ran into this issue, I probably would have assumed that where in the class tree a special method like __str__ was implemented was simply an implementation detail, not something that was visible as part of a class's API. Obviously, I would have been wrong. In Python, you can tell the difference and quite easily write code that depends on it, code that was presumably natural to experienced Python programmers.

(Possibly the existence of this implicit API was obvious to experienced Python programmers, along with the implication that various builtin types that currently don't have their own __str__ can't be given one in the future.)

InheritanceAndImplictAPI written at 23:16:22;

(Previous 10 or go back to January 2025 at 2025/01/13)

Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.