2023-11-18
Using argparse in my Python programs encourages me to add options to them
Today, for reasons outside the scope of this entry, I updated one of my very old little Python utilities to add some features and move it to Python 3 in the process. This program was so old it was using getopt, so as part of updating it I switched it over to argparse, which is what I use in all of my modern programs. The change was positive in a variety of ways, but one of the things it did was it immediately caused me to add some more command line options. This isn't due to anything specific to this program, because over and over again I've had much the same experience; my argparse based programs have more options (and often better structured ones).
In thinking about it I believe there's a couple of reasons for this. First, argparse makes it easy to add the basic support for an option in a single place, in a ArgumentParser.add_argument() call. Unlike with getopt's much more primitive argument handling, I don't have to modify code in several places just to have my program accept the option, handle it to set or record some value, and include it in usage and help. Even if the option does nothing this is an easy first step.
Second, that argparse generates an 'args' (or 'options') object with all of the parsed information on it often makes it easy to expose new options to the code that needs to look at them. My usual pattern with argparse is to pass the 'opts' object you get from ArgumentParser.parse_args() down to functions that get called to do the work, rather than break out specific flags, options, and so on separately. This means that a new option is usually pervasively available through the code and doesn't have to be passed in specifically; I can just check 'opts.myNewOption' wherever I want to use it. By contrast, getopt didn't create a natural options object so I tended to pass around options separately (or worse, set global variables because it was the easy lazy way).
This doesn't always work; sometimes I need the new option in a place where I haven't already passed in the opts object. But it works a lot of the time, and when it works it makes it that much easier to code up the use of the new option.
(This also means it's easy to reverse out of an option I decide that I don't want after all. I can delete the creation of it and the use of it without having to go through changing function arguments around. Similarly it's easy to change how an option works because that too doesn't touch many places in the code.)
2023-11-17
My first Django database migration worked fine
We have a long standing Django application, using SQLite as the database layer because that's easy and all we need at our tiny scale. For as long as we've had the application I've not so much as breathed on its database schema (which is to say its model), because the thought of trying to do any sort of database migration was a bit scary. For reasons outside the scope of this entry, we recently decided that it was time to add some fields to our application model, so I got to (or had to) try out Django's support for more or less automatic database migrations. The whole experience was pretty painless in my simple case, although I had some learning experiences.
The basics of Django migrations are well covered by Django's documentation. For straightforward changes, you change your model(s) and then run the 'makemigrations' Django subcommand, supplying the name you want for your migration; Django will write a new Python file to do the work. Once you've made the migration you can then apply it with the 'migrate' subcommand, and in at least some cases un-apply it again. Our changes added simple fields that could be empty, which is about the simplest case you can get; I don't know how Django handles more complicated cases, for example introducing a mandatory non-null field.
One learning experience is that Django will want to create a additional new migrations if you tinker with the properties of your (new) model things after you've created (and probably applied) the initial migration in your development environment. For me, this happened relatively naturally as I was writing Django code to use these new fields, now that we had them. You probably don't want to wind up with a whole series of iterating migrations, so you're going to want to somehow squash these down to a single final migration for the production change. Since we use SQLite, I wound up just repeatedly removing the migration file and reverting my development database to start again, rather than tinker around with un-applying the migration and trying to get Django to rewrite it.
(Now that I'm reading the documentation all the way through there's a section on Squashing migrations, but it seems complex and not necessarily quite right for this case.)
While it's not strictly a migration issue as such, one thing that I initially forgot to do when I added new model fields was to also update the settings for our Django admin interface to display them. Partly this happened because it's been so long since I touched this code that I'd somewhat forgotten how it all worked until I looked at the admin interface, didn't see these fields in the particular model object, and started digging.
Although the Django migration documentation contains scary warnings about doing migrations with a SQLite database, I had no problem doing it in production with our generally barely used web application (we normally measure activity in 'hits per month'). Given how Django does database migrations on SQLite, a sufficiently active web application might have to shut down during the migration so that the database could be fiddled with without problems. In general, shutting down will avoid a situation where your running code is incompatible with the state of your database, which can definitely happen even in simple changes.
(Some people probably deploy such a database migration in multiple steps, but with a tiny application we did it the single step way, deploying a new version of the code that used the new fields at the same time as we performed the database migration on the production database.)
My overall experience is quite positive; changing our model was nowhere near as scary or involved as I expected it to be. I suspect I'm going to be much more willing to make model changes in the future, although I still don't think of it as a casual thing.
2023-10-11
The wisdom of being selective about python-lsp-server plugins
As far as I know, the most common Python server for the Language Server Protocol is still python-lsp-server (pylsp), although there are now some alternatives (see eg the Emacs lsp-mode page on this for all languages). Pylsp can use a number of plugins (what it calls 'providers') that provide additional checks, warnings, and so on; some are integrated into the pip installation process if you ask for them, and some of them require injecting third party packages. One of these integrated ones is support for pycodestyle.
Although I may have started out not including pycodestyle in my pylsp setup, at some point I started adding it; perhaps I thought it wouldn't do any harm and didn't seem to produce obnoxious warnings. Then I tried another Emacs LSP implementation and discovered that I actually hated pycodestyle's complaints about styles. The reason I hadn't noticed before is that lsp-mode appears to ask pylsp to disable it (the diagnostics you get from a LSP server depend both on your editor's setup and your LSP server).
Although it would be nice if the LSP support in all editors made it straightforward to have a LSP server disable things, the larger lesson for me is that I should stop hitting myself in the face. If I don't want pycodestyle's copious picky complaints about how exactly I format my Python code, I shouldn't include it in my pylsp setup in the first place. The corollary to this is that before I include a linter or a checker in my pylsp setup, it might be a wise idea to try it out separately to see if I like it (before or after customization, as might be necessary for something like ruff). Testing a checker or linter as a standalone thing doesn't guarantee I'll get exactly the same results in pylsp (or any other LSP server), but it's at least something.
(Thus, I think you shouldn't install 'python-lsp-server[all]', because you may be getting a number of surprises. Even if you like what you get now, if you reinstall this later, pylsp may have changed its idea of what 'all' should include in the mean time.)
This also suggests that I should be minimal in my pylsp configuration, just in case. Since I'm going to see the LSP warnings all of the time, more or less, I should stick to things that provide information I almost always want. I can always run additional checkers by hand or by semi-automating them in my editor or even a Makefile.
(I continue to think that setting up Python LSP support in GNU Emacs is reasonably worth it. In Emacs 29.1 you don't even need any third party packages, since 29.1 includes Eglot.)
PS: It's nice to see that Python has picked up a surprising number of linters and checkers while I wasn't looking. A part of me is still way back in the Python 2 era where my option was basically pychecker. And pipx makes it easy to try them out.
2023-09-15
Insuring that my URL server and client programs exit after problems
I recently wrote about my new simple system to open URLs on my desktop from remote machines, where a Python client (on the remote server) listens on a Unix domain socket for URLs that programs (like mail clients) want opened, and reports these URLs to the server on my desktop, which passes them to my browser. The server and client communicate over SSH; the server starts by SSH'ing to the remote machine and running the client. On my desktop, I run the server in a terminal window, because that's the easy approach.
Whenever I have a pair of communicating programs like this, one of my concerns is making sure that each end notices when the other goes away or the communication channel breaks, and cleans itself up. If the SSH connection is broken or the remote client exits for some reason, I don't want the server to hang around looking like it's still alive and functioning; similarly, if the server exits or the SSH connection is broken, I want the remote client to exit immediately, rather than hang around claiming to other parties that it can accept URLs and pass them to my desktop to be opened in a browser.
On the server this is relatively simple. I started with my standard stanza for Python programs that I want to die when there are problems:
signal.signal(signal.SIGINT, signal.SIG_DFL) signal.signal(signal.SIGPIPE, signal.SIG_DFL) signal.signal(signal.SIGHUP, signal.SIG_DFL)
If I was being serious I should check to see what SIGINT was initially set to, but this is a casual program, so I'll never run it with SIGINT deliberately masked. Setting SIGHUP isn't necessary today, but I didn't remember that until I checked and Python could change it.
Since all the server does is read from the SSH connection to the client, I can detect both client exit and SSH connection problems by looking for end of file, which is signalled by an empty read result:
def process(host: str) -> None: pd = remoteprocess(host) assert(pd.stdout) while True: in = pd.stdout.readline() if not in: break [...]
As far as I know, our SSH configurations use TCP keepalives, so if the connection between my machine and the server is broken, both ends will eventually notice.
Arranging for the remote client to exit at appropriate points is a bit harder and involves a hack. The client's sign that the server has gone away is that the SSH connection gets closed, and one sign of that is that the client's standard input gets closed. However, the client is normally parked in socket.accept() waiting for new connections over its Unix socket, not trying to read from the SSH connection. Rather than write more complicated Python code to try to listen for both a new socket connection and end of file on standard input (for example using select), I decided to use a second thread and brute force. The second thread tries to read from standard input and forces the entire program to exit if it sees end of file:
def reader() -> None: while True: try: s = sys.stdin.readline() if not s: os._exit(0) except EnvironmentError: os._exit(0) [...] def main() -> None: [the same signal setup as above] t = threading.Thread(target=reader, daemon=True) t.start() [rest of code]
In theory the server is not supposed to send anything to the client, but in practice I decided that I would rather have the client exit only on an explicit end of file indication. The use of os._exit() is a bit brute force, but at this point I want all of the client to exit immediately.
This threading approach is brute force but also quite easy, so I'm glad I opted for it rather than complicating my life a lot with select and similar options. These days maybe the proper easy way to do this sort of thing is asyncio with streams, but I haven't written any asyncio code.
(I may take this as a challenge and rewrite the client as a proper asyncio based program, just to see how difficult it is.)
All of this appears to work in casual testing. If I Ctrl-C the server in my terminal window, the remote client dutifully exits. If I manually kill the remote client, my local server exits. I haven't simulated having the network connection stop working and having SSH recognize this, but my network connections don't get broken very often (and if my network isn't working, I won't be logged in to work and trying to open URLs on my home desktop).
2023-08-14
A brief brush with writing and using Python type hints
I was recently nerd sniped into writing a Python version of a
simple although real exercise. As part
of that nerd snipe, I decided to write my Python using type hints
(which I've been tempted by for some time).
This is my first time really trying to use type hints, and I did
it without the benefit of reading any 'quick introduction to Python
type hints' articles; I worked from vague memories of seeing the
syntax and reading the documentation for the standard library's
typing
module.
I checked my type hints with mypy, without doing anything particularly
fancy.
Looking at what I wrote now, I see I missed one trick through ignorance, which is how to declare attributes of objects. I wrote:
class X: def __init__(self) -> None: self.known_tests: list[str] = []
The idiomatic way of doing this is apparently:
class X: known_tests: list[str] def __init__(self) -> None: self.known_tests = []
I believe that mypy can handle either approach but the second is what I've seen in some recent Python articles I've read.
The declaration for '__init__
' is another thing that I had to
stumble over. Initially I didn't put any type annotations on
'__init__
' because I couldn't see anything obvious to put there,
but then mypy reported that it was a method without type annotations.
Marking it explicitly as returning None
caused mypy to be happy.
While writing the code, as short and trivial as it is, I know that I made at least one absent-minded mistake that mypy's type checking would have caught. I believe I made the mistake before I fully filled out the types, so it's possible that simply filling them out would have jogged my mind about things so I didn't slip into the mistake. In either case, having to think about types enough to write them down feels useful, on top of the type checking itself.
At the same type, typing out the types felt both bureaucratic and verbose. Some of this is because my code involves several layers of nested containers; I have tuples inside lists and being returned by a generator. However, I don't think this is too unusual, so I'd expect to want to define a layer of intermediate types in basically anything sophisticated, like this:
logEntryType = tuple[str, typing.Any]
This name exists only to make type hints happy (or, to put it the
other way, to make them less onerous to write). It's not present
in the code or used by it. Possibly this is a sign that in type
hint heavy code I'd wind up wanting to define a bunch of small data
only dataclasses,
simply so I could use these names outside of type hints. This makes
me wonder if retrofitting type hints to already written code will
be somewhat awkward, because I'd wind up wanting to structure the
data differently. In code without type hints, slinging around tuples
and lists is easy, and 'bag of various things' is a perfectly okay
data structure. In code with type hints, I suspect all of that may
get awkward in the same way this 'logEntryType
' is.
Despite having gone through this exercise, I'm not sure how I feel about using type hints in Python. I suspect that I need to write something more substantial with type hints, or try to retrofit some of our existing code with them, or both, before I can have a really solid view. But at the very least they didn't make me dislike the experience.
2023-07-03
Our Python fileserver management code has been quite durable over the years
At this point we've been running our ZFS based NFS fileserver environment for about fifteen years, starting with Solaris 10 and evolving over time to the current Ubuntu 22.04 based servers. Over that time we've managed the various iterations of the same basic thing primarily through a local set of programs (all with names starting in 'san', despite the fact that we don't have a SAN any more). These programs have always been written in Python. They started out as Python 2 programs on Solaris and then OmniOS, and were moved to Python 3 when we moved to our Linux based fileservers. Naturally, we have version control history for the Python code of these tools that goes all the way back to the first versions in 2008.
(For reasons, the Solaris and the Linux versions are in different source repositories.)
I was recently working on these programs,
which made me curious to see how much the current ones have changed
from the very first versions. The answer turns out to be not very
much, and only in two real areas. The first is that in the change
from Python 2 to Python 3, we stopped using pychecker annotations
and the optparse
module, switching to argparse
(and making a
few other Python 3 changes). The second is that when we moved from
the OmniOS fileserver generation
to the Linux fileserver generation,
we moved from using iSCSI disks that came from iSCSI backends (and
Solaris/OmniOS device names) to using locally attached disks with,
naturally, Linux device names. Otherwise, the code is almost entirely
the same. Well, for features that have always existed, since we
added features to the tools over time. But even there, most of the
features were present by the end of the OmniOS era, and their code
mostly hasn't changed between then and now.
(In some programs, more comments changed than code did. This has
left some vaguely amusing artifacts behind, like a set of local
variables cryptically called 'bh
', 'bd
', and 'bl
', which
were originally short for 'backend host/disk/lun'. We no longer
have hosts or LUNs, but we still have things that fill in those
slots and I never renamed the local variables.)
On the one hand, this is what you'd want in a programming language; when and if there's change, it's because you're changing what the code does and how it does it, not because the language environment has changed or works differently on different systems. On the other hand, these days it feels like some programming environments exist in a constant state of churn, with old code either directly obsolete or functionally obsolete within a few years due to changes around it. Python hasn't been without such changes (see Python 2 to Python 3), but in practice a lot of code really has carried on basically as-is. This is something we rather appreciate in our local tools, because our real goal isn't to write and maintain tools, it's to do things with them.
2023-06-27
Belatedly remembering to use the two expression form of Python's assert
Today I confessed on the Fediverse that I had somehow
mentally overwritten what I once knew about Python's assert
with a C-like version that I wrote as
'assert(expression)
' (which I apparently started doing more
than a decade ago). What caused me to notice
this was that I was revising some Python code to cope with a new
situation, and I decided I wanted to fail in some way if an impossible
thing turned out to not be as impossible as I thought. This wasn't
an error that should be returned normally, and it wasn't really
something I wanted to raise as an assertion, so adding an assert
was the easy way.
So at first I wrote 'assert(2 <= n <= 23)
', and then in my way
deliberately forced the assert to fail to test things. This caused
me to change the variable name to make the assert slightly more
informational, as 'assert(2 <= disknum <= 23)
'. This gave a better
clue about what the assert was about, but it didn't say what was
wrong. Thinking about how to fix that caused a dim flickering light
to appear over my head and sent me off to read the specification
of assert
,
which told me about the two expression version and also reminded
me that assert
is a statement, not a C-like function call.
(My new use of assert
in my code hopefully includes enough
information about the surrounding context that I can see what
went wrong, if something does. It won't give me everything but
these are quick, low-effort checks that I don't expect to ever
trigger.)
Now that I've re-discovered this full form of assert
, my goal is
to use it more often for "this is never expected to happen" safety
checks in my code. Putting in a single line of an assert
can
convert an otherwise mysterious failure (like the famous 'NoneType
object has no attribute ...' error) into a more explicit one, and
prevent my code going off the rails in cases where it might not
fail immediately.
(I know, CPython will strip out these assert
statements if we
ever run with optimization enabled. We're unlikely to ever do that
for these Python programs.)
As a side note, in general Python's syntax allows for both putting
unnecessary ()'s around expressions and then not having a space between
a statement and an expression. This allows what would normally be
'assert expr
' to be transformed into 'assert(expr)
', so that it
looked like a function call to me. Fortunately there are only a few
simple statements that can even be potentially confused this way, and
I suspect I'm not likely to imagine 'raise
' or 'yield
' could be
function calls (or 'return
').
(You can write some complex statements this way, such as 'if(expr):
',
but then the ':' makes it clear that you have a statement, not a
function call.)
2023-04-30
Os.walk, the temptation of hammers, and the paralysis of choice
I have a shell script to give me a hierarchical, du-like report of memory usage broken down by Linux cgroup. Even back when I wrote it, it really needed to be something other than a shell script, and a recent addition made it quite clear that the time had come (the shell script version is now both slow and inflexible). So as is my habit, I opened up a 'memdu.py' in my editor and started typing. Some initial functions were easy, until I got to here:
def walkcgroup(top): for dirpath, dirnames, filenames in os.walk(top, topdown=True): if memstatfile not in filenames: dirnames[:] = [] continue
Then I stopped typing because I realized I had a pile of choices of make about exactly how this program was going to be structured, and maybe I didn't want to use os.walk(), as shown by the very first thing I wrote inside the for loop.
The reason I started writing code with os.walk() is because it's the obvious hammer to use when you want to walk all over a directory tree, such as /sys/fs/cgroup. But on the other hand, I realized that I'm not just visiting each directory; I'm in theory constructing a hierarchical tree of memory usage information. What os.walk() gives you is basically a linear walk, so if you want a tree reconstructing it is up to you. It's also more awkward to cut off walking down the tree if various conditions are met (or not met), especially if one of the conditions is 'my memory usage is the same as my parent's memory usage'. If what I want is really a tree, then I should probably walk the directory hierarchy myself (and pass each step its parent node, already loaded with memory information, and so on).
On the third hand, the actual way this information will be printed out is as a (sorted) linear list, so if I build a tree I'll have to linearize it later. Using os.walk() linearizes it for me in advance, and I can then readily sort it into some order. I do need to know certain information about parents, but I could put that in a dict that maps (dir)paths to their corresponding data object (since I'm walking top down I know that the parent will always be visited before the children).
A lot of these choices come down to what will be more convenient to code up, and these choices exist at all because of the hammer of os.walk(). Given the hammer, I saw the problem as a nail even though maybe it's a screw, and now I've realized I can't see what I have. Probably the only way to do so is to write one or the other version of the code and see how it goes. Why haven't I done that, and instead set aside the whole of memdu.py? That's because I don't want to 'waste time' by writing the 'wrong' version, which is irrational. But here I am.
(Having written this entry I've probably talked myself into going ahead with the os.walk version. If the code starts feeling awkward, much of what I've built will probably be reusable for a tree version.)
PS: This isn't the first time I've been blinded by a Python feature.
2023-03-12
Getting a Python 2 virtual environment (in 2023's twilight of Python 2)
Suppose, not entirely hypothetically, that you need to create a new
Python 2 virtual environment today; perhaps you need to install
some package to see how its old Python 2 version behaves. With
Python 3, creating a virtual environment is really easy; it's just
'python3 -m venv /tmp/pytest
'. With Python 2 today, you have two
complications. First, Python 2 doesn't have a venv
module (instead
it uses a 'virtualenv
' command), and second, your installed Python
2 environment may not have all of the necessary infrastructure
already set up since people are deprecating Python 2 and cutting
down any OS provided version of it to the bare minimum.
First, you need a Python 2 version of pip. Hopefully you have one
already; if not, you want the 2.7 version of get-pip.py
, but don't count on that URL
lasting forever, as the URL in my 2021 entry on this didn't. I haven't tested this latest version,
so cross your fingers. If you still care at all about Python 2, you
probably really want to make sure you have a pip2 at this point.
Once you have a pip2 in one way or another, you want to do a user
install of 'virtualenv
', with 'pip2 install --user virtualenv
'. This
will give you a ~/.local/bin/virtualenv command, which you may want to
rename to 'virtualenv2'. You can then use this to create your virtual
environment, 'virtualenv2 /tmp/pytest
'. The result should normally
have everything you need to use the virtualenv, including a pip2, and
you can then use this virtualenv pip2 to install the package or packages
you need to poke at.
Incidentally, if you just want to get a copy of the Python 2 version of
a particular package and not specifically install it somewhere, you can
just use pip2 to download it, with 'pip2 download <whatever>
'. I'm
not sure that the result is necessarily immediately usable and you'll
have to decode it yourself ('file
' may be your friend), but depending
on what you want this may be good enough.
(I took a quick look to see if there was an easier way to find out
the last supported Python 2 version of a package than 'pip2 download
<whatever>
', but as far as I can see there isn't.)
(This is one of the entries that I write for myself so that I have this information if I ever need it again, although I certainly hope not to.)
PS: Another option is to use the Python 2.7 version of PyPy, which I believe comes pre-set with its own pip2, although not its own already installed virtualenv. Depending on how concerned you are about differences in behavior between CPython 2.7 and PyPy 2.7, this might not be a good option.
2023-02-20
A bit on unspecified unique objects in Python
In Why Aren't Programming Language Specifications Comprehensive? (via), Laurence Tratt shows the following example of a difference in behavior between CPython and PyPy:
$ cat diffs.py print(str(0) is str(0)) $ python3 diffs.py False $ pypy diffs.py True
Tratt notes that Python's language specification doesn't specify
the behavior here, so both implementations are correct. Python does
this to preserve the ability of implementations to make different
choices, and Tratt goes on to use the example of __del__
destructors.
This might leave a reader who is willing to accept the difference in
destructor behavior to wonder why Python doesn't standardize object
identity here.
Since this code uses 'is
', the underlying reason for the difference
in behavior is whether two invocations of 'str(0)
' in one expression
result in the same actual object. In CPython 3, they don't; in PyPy,
they do. On the one hand, making these two invocations create the
same object is an obvious win, since you're creating less objects
and thus less garbage. A Python implementation could do this by
knowing that using str() on a constant results in a constant result
so it only needs one object, or it could intern the
results of expressions like 'str(0)' so that they always return the
same object regardless of where they're invoked. So allowing this
behavior is good for Python environments that want to be nicely
optimized, as PyPy does.
On the other hand, doing either of these things (or some combination of them) is extra work and complexity in an implementation. Depending on the path taken to this optimization, you have to either decide what to intern and when, then keep track of it all, or build in knowledge about the behavior of the built in str() and then verify at execution time that you're using the builtin instead of some clever person's other version of str(). Creating a different str() function or class here would be unusual but it's allowed in Python, so an implementation has to support it. You can do this analysis, but it's extra work. So not requiring this behavior is good for implementations that don't want to have the code and take the (extra) time to carefully do this analysis.
This is of course an example of a general case. Languages often
want to allow but not require optimizations, even when these
optimizations can change the observed behavior of programs (as they
do here). To allow this, careful language specifications set up
explicit areas where the behavior isn't fixed, as Python does
with is
(see the footnote).
In fact, famously CPython doesn't even treat all types of objects
the same:
$ cat diff2.py print(int('0') is int('0')) $ python3 diff2.py True $ pypy diff2.py True
Simply changing the type of object changes the behavior of CPython. For that matter, how we create the object can change the behavior too:
$ cat diff3.py print(chr(48) == str(0)) print(chr(48) is chr(48)) print(chr(48) is str(0)) $ python3 diff3.py True True False
Both 'chr(48)' and 'str(0)' create the same string value, but only one of them results in the same object being returned by multiple calls. All of this is due to CPython's choices about what it optimizes and what it doesn't. These choices are implementation specific and also can change over time, as the implementation's views change (which is to say as the views of CPython's developers change).