Python programs as wrappers versus filters of other Unix programs
Sometimes I wind up in a situation, such as using smartctl's JSON output, where I want to use a Python program to process and transform the output from another Unix command. In a situation like this, there are two ways of structuring things. I can have the Python program run the other command as a subprocess, capture its output, and process it, or I can have a surrounding script run the other command and pipe its output to the Python program, with the Python program acting as a (Unix) filter. I've written programs in both approaches depending on the situation.
Which sort of begs the question, namely what sort of situation makes me choose one option or the other? One reason for choosing the wrapper approach is the ease of copying the result places; a Python wrapper is only one self-contained thing to copy around to our systems, while a shell script that runs a Python filter is at least two things (and then the shell script has to know where to find the Python program). And in general, a Python wrapper program makes the whole thing feel like there are fewer moving parts (that it runs another Unix command as the program's starting point is sort of an implementation detail that people don't have to think about).
(The self contained nature of wrappers pushes me toward wrappers for things that I expect to copy to systems only on an 'as needed' basis, instead of having them installed as part of system setup or the like.)
One reason I reach for the filter approach is if I have a certain amount of logic that's most easily expressed in a shell script, for example selecting what disks to report SMART data on and then iterating over them. Shell scripts make expanding file name glob patterns very easy; Python requires more work for this. I have to admit that how the idea evolved also plays a role; if I started out thinking I had a simple job of reformatting output that could be done entirely in a shell script, I'm most likely to write the Python as a filter that drops into it, rather than throw the shell script away and write a Python wrapper. Things that start out clearly complex from the start are more likely to be a Python wrapper instead of a filter used by a shell script.
(The corollary of this is if I'm running the other command once with more or less constant arguments, I'm much more likely to write a wrapper program instead of a filter.)
I believe that there are (third party) Python packages that are intended to make it easy to write shell script like things in Python (and I think I was even pointed at one once, although I can't find the reference now). In theory I could use these and native Python facilities to write more Python programs as wrappers; in practice, I'm probably going to take the path of least resistance and continue to do a variety of things as shell scripts with Python programs as filters.
I don't know if writing this entry is going to get me to be more systematic and conscious about making this choice between a wrapper and a filter, but I can hope so.
PS: Another aspect of the choice is that it feels easier (and better known) to adjust the settings of a shell script by changing commented environment variables at the top of the script than making the equivalent changes to global variables in the Python program. I suspect that this is mostly a cultural issue; if we were more into Python, it would probably feel completely natural to us to do this to Python programs (and we'd have lots of experience with it).
The state of Python (both 2 and 3) in Ubuntu 22.04 LTS
Ubuntu 22.04 LTS has just been released and is on our minds, because we have a lot of Ubuntu 18.04 machines to upgrade to 22.04 in the next year. Since both we and our users use Python, I've been looking into the state of it on 22.04 and it's time for an entry on it, with various bits of news.
On the Python 2 front, 22.04 still provides a package for it but takes some visible (to us) steps towards eventually not having Python 2 at all. First, there is no official package provided to make /usr/bin/python point to Python 2. You can install a package (python-is-python3) to make /usr/bin/python point to Python 3, or you can have nothing official. Ubuntu is not currently forcing /usr/bin/python to be anything, so you can create your own symlink by hand (or with your own package) if you want to. We probably will because it turns out we have a reasonable number of users using '/usr/bin/python' and currently they're getting Python 2.
(If we were very energetic we would try to identify all of the users with the Linux auditing system and then go nag them. But probably not.)
Ubuntu 22.04 also drops support for the Python 2 version of Apache's mod_wsgi, which is still relevant for our Django application. It's possible that 22.04 still provides enough Python 2 support that you could build it yourself, but I haven't looked into this. Theoretically there is a Python 2 version of pip available, as 'python-pip'; in practice this conflicts with the Python 3 version, which is much more important now. If you need the Python 2 version of pip, you're going to have to install it yourself somehow (I haven't checked to see if the information from my entry on this issue works on 22.04).
Python 3 on Ubuntu 22.04 is currently in great shape. Ubuntu 22.04 comes with Python 3.10.4, which is the most current version as of right now and, impressively, was only released a month ago. Someone pushed hard to get that into 22.04 (the actual binary says it was built on April 2nd). It also packages the Python 3.8 version of PyPy 7.3.9 (as well as a Python 2 version). This is also the current version as of writing this entry (a day after Ubuntu 22.04's official release). How current both PyPy and Python 3 are is a pleasant surprise; they may drift out of date in time in the usual Ubuntu LTS way, but at least they're starting out in the best state possible.
(Ubuntu 22.04 also packages Django 3.2.12, the current Django LTS release; as I write this, Django 4.0.4 is the latest non-LTS release. I happen to think that relying on Ubuntu's Django is probably a bad idea, Based on the Django project support timeline here, 3.2 will only be supported by the Django project for two more years, until April of 2024; after that, Canonical is on its own to keep up with security issues for the remaining three years of 22.04 LTS. The package appears to be in the 'main' repository that Canonical says they support, but what that means in practice I don't know.)
Helpfully, Ubuntu 22.04 has a current version of pipx, which is now my favorite tool for handling third party Python programs that either aren't packaged by Ubuntu at all or where you don't want to be stuck with the Ubuntu versions. However, pipx has some challenges when moving from Python version to Python version, for example if you're upgrading your version of Ubuntu.
Fixing Pipx when you upgrade your system Python version
If you use your system's Python for pipx and then upgrade your system and its version of Python, pipx can have a bad problem that renders your pipx managed virtual environments more or less unrecoverable if you do the wrong thing. Fortunately there turns out to be a way around it, which I tested as part of upgrading my office desktop to Fedora 35 today.
Pipx's problem is that it stashes a bunch of stuff in a ~/.local/pipx/shared virtual environment that depends on the Python version. If this virtual environment exists but doesn't work in the new version of Python that pipx is now running with, pipx fails badly. However, pipx will rebuild this virtual environment any time it needs it, and once rebuilt, the new virtual environment works.
So the workaround is to delete the virtual environment, run a pipx command to get pipx to rebuild it, and then tell pipx to reinstall all your pipx environments. You need to do this after you've upgraded your system (or your Python version). What you do is more or less:
# get rid of the shared venv rm -rf ~/.local/pipx/shared # get pipx to re-create it pipx list # have pipx fix all of your venvs pipx reinstall-all
Perhaps there is an easier way to fix up all of your pipx managed
virtual environments other than '
pipx reinstall-all', but that's
what I went with after my Fedora 35 upgrade and it worked. In any
case, I feel that it's not a bad idea to recreate pipx managed virtual
environments from scratch every so often just to clean out any lingering
(It also seems unlikely that there is any better way in general. In one way or another, all of the Python packages have to get reinstalled under the new version of Python. Sometimes you can do this by just renaming files, but any package with a compiled component may need (much) more work. Actually doing the pip installation all over again insures that all of this gets done right, with no hacks that might fail.)
Some problems that Python's
In my entry on our limited use of the
as a nice simple way to write Python CGIs that deal with parameters,
POST forms. Unfortunately there are some dark sides
cgi.FieldStorage (apart from any bugs it may have), and in
fairness I should discuss them. Overall,
probably safe for internal usage, but I would be a bit wary of
exposing it to the Internet in hostile circumstances. The ultimate
problem is that in the name of convenience and just working,
cgi.FieldStorage is pretty trusting of its input, and on the
general web one of the big rules of security is that your input is
entirely under the control of an attacker.
So here are some of the problems that
cgi.FieldStorage has if you
expose it to hostile parties. The first broad issue is that FieldStorage
doesn't have any limits:
- it allows people to upload files to you, whether or not you
expected this; the files are written to the local filesystem.
Modern versions of
FieldStoragedo at least delete the files when the Python garbage collector destroys the FieldStorage object.
- it has no limits on how large a
POSTbody it will accept or how long it will wait to read a
POSTbody in (or how long it will wait to upload files). Some web server CGI environments may impose their own limits on these, especially time, but an attacker can probably at least flood your memory.
(The FieldStorage init function does have some parameters that could be used to engineer some limits, with additional work like wrapping standard input in a file-like thing that imposes size and time limits. For size limits you can also pre-check the Content-Length.)
Then there is the general problem that
are not actually really like a Python dict (or any language's form of
it). All dictionary like things require unique keys, but attackers
are free to feed you duplicate ones in their requests. FieldStorage's
behavior here is not well defined, but it probably takes the last
version of any given parameter as the true one. If something else
in your software stack has a different interpretation of duplicate
parameters, your CGI and that other component are actually seeing
two different requests. This is a classic way to get security
(FieldStorage also has liberal parsing by default, although you
can change this with an init function parameter. Incidentally,
none of the init function parameters are covered in the
cgi documentation; you have to
help() or the cgi.py source.)
Broadly speaking, cgi.FieldStorage feels like a product of an earlier age of web programming, one where CGIs were very much a thing and the web was a smaller and ostensibly friendlier place. For a more or less intranet application that only has to deal with friendly input sent from properly programmed browsers, it's still perfectly good and is unlikely to blow up. For general modern Internet usage, well, not so much, even if you're still using CGIs.
(Wandering Thoughts is still a CGI, although with a lot of work involved. So it can be done.)
Our limited use of Python's
The news of the time interval is that Python is going to remove
some standard library modules (via). This
news caught my eye because two of the modules to be removed are
cgi and its closely
We have a number of little CGIs in our environment for internal use, and many of
them are written in Python, so I expected to find us using
all over the place. When I actually looked, our usage was much
lower than I expected, except for one thing.
Some of our CGIs are purely informational; they present some dynamic
information on a web page, and don't take any parameters or otherwise
particularly interact with people. These CGIs tend to use
so that if they have bugs, we have some hope of catching things.
When these CGIs were written,
cgitb was the easy way to do
something, but these days I would log tracebacks to syslog using
my good way to format them.
(It will probably surprise no one that in the twelve years since I
wrote that entry, none of our internal CGIs were
changed away from using
cgitb. Inertia is an extremely powerful
Others of our CGIs are interactive, such as the CGIs we use for
our self-serve network access registration systems. These CGIs need to extract
information from submitted forms, so of course they use the
cgi.FieldStorage class. As far as I know there is
and will be no standard library replacement for this, so in theory
we will have to do something here. Since we don't want file uploads,
it actually isn't that much work to read and parse a standard
body, or we could just keep our own copy of
cgi.py and use it in
(The real answer is that all of these CGIs are still Python 2 and are probably going to stay that way, with them running under PyPy if it becomes necessary because Ubuntu removes Python 2 entirely someday.)
PS: DWiki, the pile of Python that is rendering Wandering Thoughts for you to read, has its own code to handle
POST forms, which is why I know that doing that isn't too
much work. A very long time ago DWiki did use
and I had some problems as a result, but that
got entirely rewritten when I moved DWiki to being based on WSGI.
A Python program can be outside of a virtual environment it uses
A while ago I wrote about installing modules to a custom location, and in that entry one reason I said for not
doing this with a virtual environment was that I didn't
want to put the program involved into a virtual environment just
to use some Python modules. Recently I realized that you don't have
to, because of how virtual environments add themselves to
sys.path. As long as you run your program using the
virtual environment's Python, it gets to use all the modules you
installed in the venv. It doesn't matter where the program is
and you don't have to move it from its current location, you just
have to change what '
python' it uses.
The full extended version of this is that if you have your program
set up to run using '
#!/usr/bin/env python3', you can change what
Python and thus what virtual environment you use simply by changing
$PATH that it uses. The downside of this is that you can
accidentally use a different Python than you intended because your
$PATH isn't set up the way you thought it was, although in many
cases this will result in immediate and visible problems because
some modules you expected aren't there.
(One way this might happen is if you run the program using the
system Python because you're starting it with a default
One classical way this can happen is running things from crontab
Another possible use for this, especially in the
$PATH based version,
is assembling a new virtual environment with new, updated versions of
the modules you use in order to test your existing program with them.
You can also use this to switch module versions back and forth in live
usage just by changing the
$PATH your program runs with (or by
repeatedly editing its
#! line, but that's more work).
Realizing this makes me much more likely in the future to just use
virtual environments for third party modules. The one remaining
irritation is that the virtual environment is specific to the
Python version, but there are various ways
of dealing with that. This is one of the cases where I think we're
going to want to use '
pip freeze' (in advance) and then exactly
reproduce our previous install in a new virtual environment. Or
maybe we can get '
python3 -m venv --upgrade <venv-dir>' to work,
although I'm not going to hold my breath on that one.
(A quick test suggests that upgrading the virtual environment doesn't work, at least for going from the Ubuntu 18.04 LTS Python 3 to the Ubuntu 20.04 LTS Python 3. This is more or less what I expected, given what would be involved, so building a new virtual environment from scratch it is. I can't say I'm particularly happy with this limitation of virtual environments, especially given that we always have at least two versions of Python 3 around because we always have two versions of Ubuntu LTS in service.)
os.environ is surprisingly liberal in some ways
The way you access and modify Unix environment variables in Python
programs is generally through
os.environ; Python 3
being Python 3, sometimes you need
os.environb. In Unix,
what can go in the environment is somewhat fuzzy and while Python has some
issues with character encodings, it's otherwise surprisingly liberal
in a number of ways.
The first way that os.environ is liberal is that it allows environment variables to have blank values:
>>> os.environ["FRED"] = "" >>> subprocess.run("printenv") [...] FRED= [...]
It's possible to do this with some Unix shells as well, but
traditionally environment variables are generally assumed to have
non-blank values. Quite a lot of code is likely to assume that a
blank value is the same as the variable being unset, although in
Python you can tell the difference since
KeyError if the environment variable doesn't exist at all.
A bigger way that os.environ is liberal is that it will allow you to use non-traditional characters in the names of environment variables:
>>> os.environ["FRED/BAR"] = "Yes" >>> subprocess.run("printenv") [...] FRED/BAR=Yes
On Unix, setting an environment variable uses
which generally only requires that you avoid '='. Python specifically
checks for an '=' in your name so that it can generate a specific
error, and otherwise passes things through.
Python itself doesn't particularly restricted environment variable names beyond that. As a result you can do all sorts of odd things with environment variable names, including putting spaces and Unicode into them (at least in a UTF-8 environment). Some or many of these environment variables won't be accessible to a shell program, but not everything that interprets the environment follows the shell's rules.
The case where this came up for me recently was in Dovecot post-login scripting, which in some cases can require you to create environment variables with '/' in their names. Typical shells disallow this, but I was quite happy to find that Python was perfectly willing to go ahead and everything worked fine.
Python's Global Interpreter Lock is not there for Python programmers
I recently read Evan Ovadia's Data Races in Python, Despite the Global Interpreter Lock (via), which discusses what its title says. Famously, the Global Interpreter Lock only covers the execution of individual Python bytecodes (more or less), and what this does and doesn't cover is tricky, subtle, and depends on the implementation details of Python code. For example, making a Python class better and more complete can reduce what's safe to do with it without explicit locking.
These days, I've come to feel that the Global Interpreter Lock is not really for Python programmers. Who the GIL is for is the authors of CPython packages that are written in C (or in general any compiled language). The GIL broadly allows authors of those packages to not implement any sort of locking in their own code, even when they're manipulating C level data structures, because they're guaranteed that their code will never be called concurrently or in parallel. This extends to the Python standard objects themselves, so that (in theory) Python dicts don't need any sort of internal locks in order to avoid your CPython process dumping core or otherwise malfunctioning spectacularly. Concurrency only enters into your CPython extension if you explicitly release the GIL, and the rules of the CPython API make you re-take the GIL before doing much with interpreter state.
(There are probably traps lurking even for C level extensions that allow calls back into Python code to do things like get attributes. Python code can come into the picture in all sorts of places. But for simple operations, you have a chance.)
Avoiding internal locks while looking into or manipulating objects matters a lot for single threaded performance (Python code looks into objects and updates object reference counts quite frequently). It also makes the life of C extensions simpler. I'm not sure when threading was added to Python (it was a very long time ago), but there might have been C extensions that predated it and which would have been broken in multi-threaded programs if CPython added a requirement for internal locking in C-level code.
The Global Interpreter Lock can be exploited by Python programmers;
doing so is even fun. But we really shouldn't
do it, because it's not designed for us and it doesn't necessarily
work that well when we try to use it anyway. Python has a variety
of explicit locking available in the standard library
threading module, and we
should generally use them even if it's a bit more annoying.
(Honesty compels me to admit that I will probably never bother to use locking around 'simple' operations like appending to a list or adding an entry to a dict. I suspect at least some people would even see using explicit locks for that (in threaded code) to be un-Pythonic.)
Some things on Django's CSRF protection, sessions, and
We have a Django application where we've had mysterious CSRF problems in the past, which I've theorized was partly because we use it behind Apache HTTP Basic Authentication. As part of recovering my understand of Django and Apache HTTP Basic Authentication, I've been digging into how Django's CSRF protection works and how it interacts with all of this.
Our starting point is Django's documentation on Cross Site Request Forgery protection. How it works is that Django sets a CSRF cookie and then embeds a hidden form field; on form submission, the two pieces of information must be present and match (everyone does something like this). The CSRF cookie and the form field are both derived from a shared secret to protect from BREACH attacks. The important thing about this shared secret in some situations is, well, let me quote the documentation:
For security reasons, the value of the secret is changed each time a user logs in.
In a Django environment with normal authentication, it's clear when a user logs in; it's when they go through the Django login process, providing Django with a clear moment to establish an authenticated session, rotate secrets, and so on. In an environment where Django is instead relying on external authentication via REMOTE_USER, it's not so clear. The documentation says only that RemoteUserMiddleware will detect the username to authenticate and auto-login that user. The answer to this turns out to involve Django sessions.
When you have sessions enabled in Django, which you normally do,
all requests have an associated session (visible in
To simplify, important sessions are identified and tracked by browser
cookies, with one created on the fly if necessary (along with a new
session). A session may be anonymous or may be for an authenticated
If the session object for the current request lacks an authenticated
user but the request has a REMOTE_USER, RemoteUserMiddleware
'logs in' the indicated user, which will rotate the CSRF secret.
(I'm not sure how Django handles CSRF secrets for anonymous, unauthenticated people. Some versions appear to set the CSRF browser cookie without any session cookie.)
In the default Django configuration, this creates an important split between when you think you've logged in and when Django thinks you've logged in. You think you're logging in any time you have to enter your login and password for HTTP Basic Authentication (which is normally only once, until you quit the browser). However, Django only thinks you're logging in if your session is unauthenticated, and the session cookie Django sets in your browser normally lasts for two weeks (cf). Before then you can quit your browser, start it up again, re-do HTTP Basic Authentication, and not log in from Django's perspective because your session is still fine. Equally, you can keep your browser running and authenticated for more than two weeks, at which point your session cookie will expire and Django will consider you to be logging back in again (with a CSRF secret rotation) even though you were never challenged for a password.
(If you use the relevant setting to tell Django to use a browser session cookie to identify the Django session, you at least more or less synchronize Django's view of you logging in with your view of it.)
The other wrinkle is that if RemoteUserMiddleware sees an authenticated session for a request without REMOTE_USER set, it logs the session out. This is half-documented by implication, but you have to remember (or know) that 'all authenticated requests' means 'all requests with a session that thinks it's authenticated' (and the documentation doesn't actually say that your session gets logged out). This matters if part of your application is generally accessible (for anyone to submit an account request) while part of it is protected by HTTP Basic Authentication (for authorized people to approve those requests for accounts). Suppose that you go to approve an account request, which involves a CSRF protected form, but then pause and in another window go look at the unprotected account request submission page. You're now invisibly logged out, and when you submit the form in your first window, you will be logged back in, which triggers CSRF secret rotation, which invalidates the CSRF secret that underlies both the cookie and the form you just submitted.
To get around this, I think you want to use PersistentRemoteUserMiddleware instead. Or tell people not to do this.
(Much or all of this goes back at least to Django 1.10 and I don't think it changed between 1.10 and 1.11, so all of this still doesn't really explain our CSRF issue in 1.11. But at least I can now probably make problems much less likely in any version of Django.)
PS: One thing that the sessions documentation tells
you that I didn't previously know is that in the default configuration
where sessions are saved in your database, you need to clear old
expired ones out of it periodically
django-admin clearsessions'. We hadn't been doing that,
and so had entries for ones going back to 2016. The saving grace
is that I don't think sessions get written to the database until
they really have something in them, like an authenticated user;
otherwise we'd have a lot more of them in the database than we do.
Django and Apache HTTP Basic Authentication (and
We have a Django application, and part of it exists behind Apache HTTP Basic Authentication. For reasons beyond the scope of this entry, I was recently rediscovering some things about how Django interacts with Apache HTTP Basic Authentication, and so I want to write them down for myself before I forget them again.
First, the starting point in the Django documentation for this is
not to search for 'HTTP Basic Authentication' or anything like that,
but for the howto on authenticating with
which is the environment variable that Apache injects when it's
already authenticated something. I believe that if you search for
'Django' with 'Basic Authentication' on search engines, you tend
to get information about making Django or Django-related things
actually perform the server side of HTTP Basic authentication itself.
This is fair enough but can be confusing.
Second, you only need to configure Django itself to authenticate
REMOTE_USER if you want to use Django's own authentication
for something, such as access and authorization in its admin site.
It's perfectly valid (although potentially annoying) to authenticate
and limit access to your Django site (or parts of it) in your Apache
configuration with Apache's HTTP Basic Authentication but have a
separate Django login step to access the Django admin site or even
parts of your application (which will then be tracked with cookies
and so on). If you want to do this, you don't want to add Django's
RemoteUserMiddleware and so on into your Django settings.
(You'll have to manage Apache users and Django users separately, passwords included, and they won't be the same thing. This might wind up being confusing.)
If you do have Django authenticating with REMOTE_USER, you need your Django database superuser to be something you can authenticate with through Apache. If you cleverly set your database superuser to 'admin' but you have no 'admin' in your Basic Auth database, you will be sad. It's possible to get yourself out of this in a couple of ways, but it's better to avoid it in the first place.
(When you do have Django authenticating this way, ever person who uses your Django app through HTTP Basic Authentication will wind up with an entry in the Django 'User' table. Purging old logins that no longer exist is up to you, if you care. For people who you want to be able to use the Django admin site, you need to set them as at least 'Staff' in the Django User table. You can set them as database superusers too.)
It's not necessary to use Django's REMOTE_USER support in order to
make use of the authentication information yourself, as long as Apache
has HTTP Basic Authentication active. You can retrieve the login name
$REMOTE_USER environment variable and look it up in your
own 'User' table by hand, as we do. You
may or may not want to automatically create new entries for new users,
the way Django does by default. We don't because new people require some
additional configuration on our side.
The corollary to this is that you can use and test your entire site
under Apache HTTP Basic Authentication without having Django properly
wired up to use
REMOTE_USER, without noticing. I believe that
this potentially actually matters, because I believe that Django
does some things with sessions differently when you have the
RemoteUser* things enabled, and this interacts with Django's
CSRF protections. Which we've had mysterious problems with (also).