Wandering Thoughts


Python virtual environments can usually or often be moved around

Python virtual environments are magical in various ways. They get transparently added to sys.path and programs can be outside of them as long as they use the venv's Python (which is normally a symlink to some system version of Python), for two examples. All of this magic is triggered by the presence of a pyvenv.cfg file at the root of the venv (cf). The contents of this pyvenv.cfg are very minimal and in particular they don't name the location of the venv's root.

For example, here's a pyvenv.cfg:

home = /usr/bin
include-system-site-packages = false
version = 3.10.5

In fact this is the pyvenv.cfg of no less than six venvs from my Fedora 36 desktop (these are being managed through pipx, but pipx creates normal venvs). All of the pyvenv.cfg files all have the same contents because they're all using the same Python version and general settings.

Since pyvenv.cfg and the rest of the virtual environment don't contain any absolute paths to themselves and so don't 'know' where they're supposed to be, it's possible to move venvs around on the filesystem. As a corollary of this it's possible to copy a venv to a different system (in a different filesystem location or the same), provided that the system has the same version of Python, which is often the case if you're using the same Linux distribution version on both. This doesn't seem to be explicitly documented in the venv module documentation and it's possible that some Python modules you may install do require absolute paths and aren't movable, but it seems to be generally true.

If you use pipx there's a caution here, because pipx writes a pipx_shared.pth file into the venv's site-packages directory that does contain the absolute path to its shared collection of Python stuff. I believe this is part of the underlying cause of pipx's problem with Python version upgrades, which is fixed by removing this shared area and having pipx rebuild it.

Another caution comes from systems like Django which may create standard programs or files as part of their project setup. If you create a Django venv and start a Django project from it, Django will create a 'manage.py' executable for the project that has the venv's (current) absolute path to its Python interpreter burned into its '#!' line. If you then move this venv, your manage.py will (still) try to use the Python from the old venv's location, which will either not work or get you the wrong site-packages path.

On the one hand, it's convenient that this works in general, and that there's nothing in the general design of virtual environments that blocks it. On the other hand, it's clear that you can have various corner cases (as shown with pipx and Django), so it's probably best to create your venvs in their final location if you can. If you do have to move venvs (for example they have to be built in one directory and deployed under another), you probably want to test the result and scan for things with the absolute path burned into them.

(I noticed this pyvenv.cfg behavior when I first looked at venvs and sys.path, but I didn't look very much into it at the time. As usual, writing an entry about this has left me better informed than before I started.)

VenvsCanUsuallyBeMoved written at 22:28:01; Add Comment


Python is my default choice for scripts that process text

Every so often I wind up writing something that needs to do something more complicated than can be readily handled in some Bourne shell, awk, or other basic Unix scripting tools. When this happens, the language I most often turn to is Python, and especially Python is my default choice when the work I'm doing involves processing text in some way (or often if I need to generate text). For example, if I want to analyze the output of some command and generate Prometheus metrics from it, Python is often my choice. These days, this is Python 3, even with its warts with handling non-Unicode input (which usually don't come up in this context).

(A what a lot of these programs do could be summarized as string processing with logic.)

In theory there's no obvious reason that my language of choice couldn't be, say, Go. But in practice, Python has much less friction than something like Go while still having enough structure and capabilities to be better than a much more limited tool like awk. One part of this is Python's casualness about typing, especially typing in dicts. In Python, you can shove anything you want into a dict and it's completely routine to have dicts with heterogenous values (usually your keys are homogenous, eg all strings). This might be madness in a large program, but for small, quickly written things it's a great speedup.

(Some of the need for this can be lessened with dataclasses or attrs. And Python lets you scale up from basic dicts to those, or to basic classes used as little more than records, as you decide they make your code simpler.)

Another area where Python reduces friction is in the lack of explicit error handling while still not hiding errors; exceptions insure that while you may not deal with errors well, you will deal with them one way or another. Again this isn't necessarily what you want in a bigger, more structured program, but in the small it's quite handy to not have to ornament every 'int(...)' or whatever with some sort of error check.

In general, Python is (surprisingly) good at pulling strings apart, shuffling them around, and putting them back together, while still staying structured enough to let me follow what the code does even when I come back to it later. Compact, low ceremony inline string formatting is often quite useful (I use '%' because I'm old fashioned).

Python certainly isn't the only language that can be used in this way; Perl and Ruby are two other obvious examples, and more modern people would probably reach for Javascript. But Python is the one that I've wound up latching on to and sticking with.

I do find it a bit amusing and ironic that despite all of the issues in Python 3 with Unicode and IO (and my gripes surrounding that), it's what I normally use for processing text. In theory, I risk explosions; in practice, it works because I'm in a UTF-8 capable environment with well formed input (often just plain ASCII, which is the most common case for log files and command output).

PythonForStringHandling written at 22:09:30; Add Comment


Humanizing numbers in Python through a regexp substitution function

Recently I was looking at files that contained a bunch of sizes in bytes with very widely varying magnitudes, something like this:

file 10361909248
percpu 315360
inactive_file 8666644480
active_file 1695264768
slab_reclaimable 194324760
slab 194324760

(This is from Linux cgroup memory accounting.)

I find it hard to look at these numbers and have any feel for how big they are in absolute or relative terms, especially if I don't want to spend a lot of time thinking about it. It's much easier for me to read these numbers if they're humanized into things like '9.7G', '308.0K', and '185.3M'. To make these files more readable, I wrote a Python filter program to replace these raw byte counts with their humanized versions.

One reason I used Python for this filter is that it's my default choice for Unix text processing that requires more than sed or a light veneer of awk. Another reason is that I knew that Python's re module had a feature that made this filter very easy, which is that re.sub() can take a function as the replacement instead of a string.

Using a replacement function meant that I could write a simple function that took a match object that was guaranteed to be all decimal digits and turn it into a humanized number (in string form). Then the main look was just:

rxp = re.compile("\d+")
def process():
  for line in sys.stdin:
    l = line.strip()
    l = rxp.sub(humanize_number, l)

The regular expression substitution does all the work of splitting the line apart and reassembling it afterward. I only need to feed lines in and dump them out afterward.

(My regular expression here is a bit inefficient; I could make it skip all one, two, and three digit numbers, for example. That would also keep it from matching numbers in identifiers, eg if a file had a line like 'fred1 100000'. For my purposes I don't need to be more precise right now, but a production version might want to be more careful.)

Python's regular expression function substitution is a handy and powerful way to do certain sorts of very generalized text substitution in a low hassle manner. The one caution to it is that you probably don't want to use it in a performance sensitive situation, because it does require a Python function call and various other things for each substitution. The last time I looked, pure text substitutions ran much faster if you could use them. Here, not only is the situation not performance sensitive but there's no way out of running the Python code one way or another, because we can't do the work with just text substitution (at least not if we want powers of two humanized numbers, as I do).

Sidebar: The humanization function

I started out writing the obvious brute force if based version and then realized that I could get much simpler code by being a bit more clever. The end result is:

KB = 1024
MB = KB * 1024
GB = MB * 1024
TB = GB * 1024

seq = ((TB, 'T'), (GB, 'G'), (MB, 'M'), (KB, 'K'))

def humanize_number(mtch):
  n = int(mtch.group())
  for sz, ch in seq:
    if n >= sz:
      return '%.1f%s' % (n / sz, ch)
  return str(n)

The seq tuple needs to be ordered from the largest unit to the smallest, because we take the first unit that the input is equal to or larger than.

RegexpFunctionSubstitutionWin written at 21:42:00; Add Comment


What is our Python 2 endgame going to be?

Every so often I think about the issue of what our eventual Python 2 endgame is going to be at work. We're going to reach some sort of endgame situation sooner or later; for example, Ubuntu has already removed support for /usr/bin/python being Python 2, although you can still do it by hand. Someday they (and other people) may mandate that /usr/bin/python is Python 3, or remove Python 2 packages entirely, or both. What are we going to do when things reach that state?

There are two sides of this; what we're going to do about our own scripts that are still using Python 2, and what will happen with our users and their scripts. For our own scripts, they could could be rewritten to Python 3 or changed to use a different Python interpreter path in their #! line, including PyPy. Since we're in control of them and the timing of any use of an operating system without Python 2, we're at least not going to be blindsided. My tentative guess at our endgame for our own scripts is that we'd probably use PyPy, although we might opt to move them to Python 3 instead.

(There's very little chance that our remaining Python 2 scripts will all conveniently be obsolete by the time CPython 2 is disappearing from Ubuntu and other operating systems. Making them obsolete would probably take a completely rebuilt from scratch new infrastructure.)

For our users, there is both good news and bad news. The good news is that as a university department, we have a certain natural degree of turnover in user population; when someone graduates and leaves, they mostly stop caring about their Python 2 scripts they had here (or moves on to a different postdoc position, or any number of other things). The bad news is that we seem to have a reasonably significant current use of '/usr/bin/python' and we haven't even looked for people who are running '/usr/bin/python2' or some other alias. Some of that usage is probably automated (in cron jobs and the like), and some of it is probably from people who will be around for years to come. In addition, not all usage of Python 2 will be in regularly run scripts (that we can catch through mechanisms like Linux's auditing framework); some of it is probably in scripts that are only run once in a while.

Unless we get lucky and things are deferred for a significant amount of time, changing /usr/bin/python (to remove it or to be Python 3) or removing Python 2 seems likely to catch a number of our users out. We probably can't find all of them in advance, or get all of them to change things even if we do find them and notify them. Some number of them will probably have long-standing scripts blow up. To reduce problems here we should probably start moving now to discourage use of Python 2 (and identify people using it).

If it's possible, the least disruptive endgame would be to continue having /usr/bin/python and CPython 2 (in the usual places), even if we provide it ourselves. However, keeping the '/usr/bin/python' name working may hamper efforts to herd people away from Python 2; at some point in the endgame, we may want to remove it or let it become Python 3. While we can use PyPy 2 for our own scripts, it's not a drop in replacement for CPython and some programs definitely fail with PyPy when they'd work with CPython.

(Also, I'm not absolutely sure that PyPy will still have a Python 2 version in, say, ten years. Yes I am considering that far into the future.)

A more disruptive endgame would be Ubuntu insisting that /usr/bin/python be Python 3 and no longer supplying Python 2 at all. If we have relatively few people using an explicit '/usr/bin/python2', we might drop our official support for CPython 2 entirely. Hopefully Ubuntu would still supply a PyPy 2, so people would have some option other than migrating their scripts to Python 3.

A third endgame would be the 'excise the remnants' option. When Ubuntu drops Python 2 entirely, we would as well regardless of the remaining use; we wouldn't hand build CPython 2 ourselves or anything. We would handle our own scripts in some way, and other people would be left on their own, with at best us installing the Ubuntu version of PyPy 2 if one existed. This endgame is the most disruptive to people but in some way the most coherent and least work for us in the long run.

PS: Fedora forced /usr/bin/python to be Python 3 a while back, and honestly it's been a good thing overall for me. I had to change some scripts in a hurry, but after that it's nice that running 'python' gets me the version I want and so on. And it's a good way to push me to use Python 3 instead of Python 2.

ConsideringOurPython2Endgame written at 22:36:56; Add Comment


Some notes on providing Python code as a command line argument

I've long known about CPython's '-c' argument, which (in the words of the manual page) lets you "specify the command [for CPython] to execute". Until recently, I thought it had to be a single statement, or at least a single line of Python code (which precluded a number of things). It turns out that this isn't the case; both CPython and PyPy will accept a command line argument for -c that contains embedded newlines, in the style of providing command line code to Unix tools like awk.

For example:

python -c 'import sys
if len(sys.argv) > 1:
   print("arguments:", sys.argv[1:])
   print("no arguments")' "$@"

(For various reasons, you still might want to make this code importable, although I haven't done so here.)

If you're directly supplying the code on the command line, as I am here, you have a choice (in a Bourne shell script or environment). You can quote the entire code with single quotes and not use a literal single quote in the Python code, or you can quote with double quotes and carefully escape several special characters but get to use single quotes. If you want to avoid all of this, you need to put the code into a shell variable:

pyprog="$(cat <<'EOF'
python -c "$pyprog" ...

As you'd expect, '__name__' in the command line code is the usual '__main__'. As the manual page covers, all further command line arguments as passed in sys.argv, with sys.argv[0] set to '-c'. Since the code doesn't have a file name (which is what would normally go in sys.argv[0]), this seems like a decent choice, and immediately passing further arguments to the code is convenient.

Although this makes it possible to have a Python program embedded into a shell script in the same way that you can do this with awk (and thus implicitly helps enable Python as a filter in a shell script), I personally don't find the idea too appealing, at least for Python code of any substance. The problem isn't the need to take extra care with embedding the Python code in your shell script, although that's not great. The real problem is that embedding Python code this way means you miss out on all sorts of tools that are in the Python programming ecology, because they only work on separate Python code.

(If I had to write something this way, I would be tempted to develop it in a separate file that the shell script invoked with 'python <filename>' instead of 'python -c', and then only embed the code into the shell script and switch to 'python -c ...' at the last moment.)

PS: Now that I know how to do this it's a little bit tempting to try out small amounts of Python code in places where awk doesn't quite have the functions and power I'd like (or at least doesn't make the functions as easy as Python does). On the other hand, awk doesn't make you think about character set conversion issues. Probably I wouldn't use this to parse and reformat smartctl's JSON, though. That's likely to be enough code that I'd want to use the usual Python tools on it.

CommandLinePrograms written at 22:31:44; Add Comment


Python programs as wrappers versus filters of other Unix programs

Sometimes I wind up in a situation, such as using smartctl's JSON output, where I want to use a Python program to process and transform the output from another Unix command. In a situation like this, there are two ways of structuring things. I can have the Python program run the other command as a subprocess, capture its output, and process it, or I can have a surrounding script run the other command and pipe its output to the Python program, with the Python program acting as a (Unix) filter. I've written programs in both approaches depending on the situation.

Which sort of begs the question, namely what sort of situation makes me choose one option or the other? One reason for choosing the wrapper approach is the ease of copying the result places; a Python wrapper is only one self-contained thing to copy around to our systems, while a shell script that runs a Python filter is at least two things (and then the shell script has to know where to find the Python program). And in general, a Python wrapper program makes the whole thing feel like there are fewer moving parts (that it runs another Unix command as the program's starting point is sort of an implementation detail that people don't have to think about).

(The self contained nature of wrappers pushes me toward wrappers for things that I expect to copy to systems only on an 'as needed' basis, instead of having them installed as part of system setup or the like.)

One reason I reach for the filter approach is if I have a certain amount of logic that's most easily expressed in a shell script, for example selecting what disks to report SMART data on and then iterating over them. Shell scripts make expanding file name glob patterns very easy; Python requires more work for this. I have to admit that how the idea evolved also plays a role; if I started out thinking I had a simple job of reformatting output that could be done entirely in a shell script, I'm most likely to write the Python as a filter that drops into it, rather than throw the shell script away and write a Python wrapper. Things that start out clearly complex from the start are more likely to be a Python wrapper instead of a filter used by a shell script.

(The corollary of this is if I'm running the other command once with more or less constant arguments, I'm much more likely to write a wrapper program instead of a filter.)

I believe that there are (third party) Python packages that are intended to make it easy to write shell script like things in Python (and I think I was even pointed at one once, although I can't find the reference now). In theory I could use these and native Python facilities to write more Python programs as wrappers; in practice, I'm probably going to take the path of least resistance and continue to do a variety of things as shell scripts with Python programs as filters.

I don't know if writing this entry is going to get me to be more systematic and conscious about making this choice between a wrapper and a filter, but I can hope so.

PS: Another aspect of the choice is that it feels easier (and better known) to adjust the settings of a shell script by changing commented environment variables at the top of the script than making the equivalent changes to global variables in the Python program. I suspect that this is mostly a cultural issue; if we were more into Python, it would probably feel completely natural to us to do this to Python programs (and we'd have lots of experience with it).

ProgramFilterVsWrapper written at 22:10:52; Add Comment


The state of Python (both 2 and 3) in Ubuntu 22.04 LTS

Ubuntu 22.04 LTS has just been released and is on our minds, because we have a lot of Ubuntu 18.04 machines to upgrade to 22.04 in the next year. Since both we and our users use Python, I've been looking into the state of it on 22.04 and it's time for an entry on it, with various bits of news.

On the Python 2 front, 22.04 still provides a package for it but takes some visible (to us) steps towards eventually not having Python 2 at all. First, there is no official package provided to make /usr/bin/python point to Python 2. You can install a package (python-is-python3) to make /usr/bin/python point to Python 3, or you can have nothing official. Ubuntu is not currently forcing /usr/bin/python to be anything, so you can create your own symlink by hand (or with your own package) if you want to. We probably will because it turns out we have a reasonable number of users using '/usr/bin/python' and currently they're getting Python 2.

(If we were very energetic we would try to identify all of the users with the Linux auditing system and then go nag them. But probably not.)

Ubuntu 22.04 also drops support for the Python 2 version of Apache's mod_wsgi, which is still relevant for our Django application. It's possible that 22.04 still provides enough Python 2 support that you could build it yourself, but I haven't looked into this. Theoretically there is a Python 2 version of pip available, as 'python-pip'; in practice this conflicts with the Python 3 version, which is much more important now. If you need the Python 2 version of pip, you're going to have to install it yourself somehow (I haven't checked to see if the information from my entry on this issue works on 22.04).

Python 3 on Ubuntu 22.04 is currently in great shape. Ubuntu 22.04 comes with Python 3.10.4, which is the most current version as of right now and, impressively, was only released a month ago. Someone pushed hard to get that into 22.04 (the actual binary says it was built on April 2nd). It also packages the Python 3.8 version of PyPy 7.3.9 (as well as a Python 2 version). This is also the current version as of writing this entry (a day after Ubuntu 22.04's official release). How current both PyPy and Python 3 are is a pleasant surprise; they may drift out of date in time in the usual Ubuntu LTS way, but at least they're starting out in the best state possible.

(Ubuntu 22.04 also packages Django 3.2.12, the current Django LTS release; as I write this, Django 4.0.4 is the latest non-LTS release. I happen to think that relying on Ubuntu's Django is probably a bad idea, Based on the Django project support timeline here, 3.2 will only be supported by the Django project for two more years, until April of 2024; after that, Canonical is on its own to keep up with security issues for the remaining three years of 22.04 LTS. The package appears to be in the 'main' repository that Canonical says they support, but what that means in practice I don't know.)

Helpfully, Ubuntu 22.04 has a current version of pipx, which is now my favorite tool for handling third party Python programs that either aren't packaged by Ubuntu at all or where you don't want to be stuck with the Ubuntu versions. However, pipx has some challenges when moving from Python version to Python version, for example if you're upgrading your version of Ubuntu.

Ubuntu2204PythonState written at 22:58:50; Add Comment


Fixing Pipx when you upgrade your system Python version

If you use your system's Python for pipx and then upgrade your system and its version of Python, pipx can have a bad problem that renders your pipx managed virtual environments more or less unrecoverable if you do the wrong thing. Fortunately there turns out to be a way around it, which I tested as part of upgrading my office desktop to Fedora 35 today.

Pipx's problem is that it stashes a bunch of stuff in a ~/.local/pipx/shared virtual environment that depends on the Python version. If this virtual environment exists but doesn't work in the new version of Python that pipx is now running with, pipx fails badly. However, pipx will rebuild this virtual environment any time it needs it, and once rebuilt, the new virtual environment works.

So the workaround is to delete the virtual environment, run a pipx command to get pipx to rebuild it, and then tell pipx to reinstall all your pipx environments. You need to do this after you've upgraded your system (or your Python version). What you do is more or less:

# get rid of the shared venv
rm -rf ~/.local/pipx/shared
# get pipx to re-create it
pipx list
# have pipx fix all of your venvs
pipx reinstall-all

Perhaps there is an easier way to fix up all of your pipx managed virtual environments other than 'pipx reinstall-all', but that's what I went with after my Fedora 35 upgrade and it worked. In any case, I feel that it's not a bad idea to recreate pipx managed virtual environments from scratch every so often just to clean out any lingering cruft.

(It also seems unlikely that there is any better way in general. In one way or another, all of the Python packages have to get reinstalled under the new version of Python. Sometimes you can do this by just renaming files, but any package with a compiled component may need (much) more work. Actually doing the pip installation all over again insures that all of this gets done right, with no hacks that might fail.)

PipxFixingPythonVersion written at 21:50:58; Add Comment


Some problems that Python's cgi.FieldStorage has

In my entry on our limited use of the cgi module, I praised cgi.FieldStorage as a nice simple way to write Python CGIs that deal with parameters, especially for POST forms. Unfortunately there are some dark sides to cgi.FieldStorage (apart from any bugs it may have), and in fairness I should discuss them. Overall, cgi.FieldStorage is probably safe for internal usage, but I would be a bit wary of exposing it to the Internet in hostile circumstances. The ultimate problem is that in the name of convenience and just working, cgi.FieldStorage is pretty trusting of its input, and on the general web one of the big rules of security is that your input is entirely under the control of an attacker.

So here are some of the problems that cgi.FieldStorage has if you expose it to hostile parties. The first broad issue is that FieldStorage doesn't have any limits:

  • it allows people to upload files to you, whether or not you expected this; the files are written to the local filesystem. Modern versions of FieldStorage do at least delete the files when the Python garbage collector destroys the FieldStorage object.

  • it has no limits on how large a POST body it will accept or how long it will wait to read a POST body in (or how long it will wait to upload files). Some web server CGI environments may impose their own limits on these, especially time, but an attacker can probably at least flood your memory.

    (The FieldStorage init function does have some parameters that could be used to engineer some limits, with additional work like wrapping standard input in a file-like thing that imposes size and time limits. For size limits you can also pre-check the Content-Length.)

Then there is the general problem that GET and POST parameters are not actually really like a Python dict (or any language's form of it). All dictionary like things require unique keys, but attackers are free to feed you duplicate ones in their requests. FieldStorage's behavior here is not well defined, but it probably takes the last version of any given parameter as the true one. If something else in your software stack has a different interpretation of duplicate parameters, your CGI and that other component are actually seeing two different requests. This is a classic way to get security vulnerabilities.

(FieldStorage also has liberal parsing by default, although you can change this with an init function parameter. Incidentally, none of the init function parameters are covered in the cgi documentation; you have to read help() or the cgi.py source.)

Broadly speaking, cgi.FieldStorage feels like a product of an earlier age of web programming, one where CGIs were very much a thing and the web was a smaller and ostensibly friendlier place. For a more or less intranet application that only has to deal with friendly input sent from properly programmed browsers, it's still perfectly good and is unlikely to blow up. For general modern Internet usage, well, not so much, even if you're still using CGIs.

(Wandering Thoughts is still a CGI, although with a lot of work involved. So it can be done.)

CGIFieldStorageIssues written at 21:47:21; Add Comment


Our limited use of Python's cgi module

The news of the time interval is that Python is going to remove some standard library modules (via). This news caught my eye because two of the modules to be removed are cgi and its closely related kin cgitb. We have a number of little CGIs in our environment for internal use, and many of them are written in Python, so I expected to find us using cgi all over the place. When I actually looked, our usage was much lower than I expected, except for one thing.

Some of our CGIs are purely informational; they present some dynamic information on a web page, and don't take any parameters or otherwise particularly interact with people. These CGIs tend to use cgitb so that if they have bugs, we have some hope of catching things. When these CGIs were written, cgitb was the easy way to do something, but these days I would log tracebacks to syslog using my good way to format them.

(It will probably surprise no one that in the twelve years since I wrote that entry, none of our internal CGIs were changed away from using cgitb. Inertia is an extremely powerful force.)

Others of our CGIs are interactive, such as the CGIs we use for our self-serve network access registration systems. These CGIs need to extract information from submitted forms, so of course they use the ever-popular cgi.FieldStorage class. As far as I know there is and will be no standard library replacement for this, so in theory we will have to do something here. Since we don't want file uploads, it actually isn't that much work to read and parse a standard POST body, or we could just keep our own copy of cgi.py and use it in perpetuity.

(The real answer is that all of these CGIs are still Python 2 and are probably going to stay that way, with them running under PyPy if it becomes necessary because Ubuntu removes Python 2 entirely someday.)

PS: DWiki, the pile of Python that is rendering Wandering Thoughts for you to read, has its own code to handle GET parameters and POST forms, which is why I know that doing that isn't too much work. A very long time ago DWiki did use cgi.FieldStorage and I had some problems as a result, but that got entirely rewritten when I moved DWiki to being based on WSGI.

CGIModuleOurUsage written at 22:47:48; Add Comment

(Previous 10 or go back to March 2022 at 2022/03/02)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.