Installing Pip in Python 2 environments that don't provide it already
In theory any modern version of Python 2 (or Python 3) is bundled
pip, although it may be an out of date version that you could
update (with something like '
python2 -m pip install --user --upgrade
pip'). In practice, some Linux distributions split
pip off into
its own package and have stopped providing this separate package
for their version of Python 2. This is definitely the case for
Fedora 32, and may soon be the case for other distributions. If you
still want a Python 2 version of Pip (for example so that you can
keep updating the Python 2 version of the Python language server), you need to install one by hand, somehow.
When I had to do this on my Fedora 32 machine I was lucky enough
that I had already done an update of the Python 2 pip on one machine
where I used '
--user' to install the new version in my $HOME, so
I had all of the Pip code in .local/lib/python2.7/site-packages and
could just copy it over, along with .local/bin/pip2. It turns out
that this simple brute force approach is probably not necessary
and there is a completely convenient alternative, which is different
than the situation I expected before I started writing this entry.
(Since pip is normally installed with your Python, I expected that bootstrapping pip outside of that was not very well supported because it was infrequently used. For whatever reason, this is not at all the case currently.)
The pip people have an entire document on installing pip that walks you through a number of options. The important one for my case is Installing with get-pip.py, where you download a get-pip.py Python program to bootstrap pip. One of the options it supports is installing pip as a user package, resulting in a .local/bin/pip2 for you to use. The simple command line required is:
python2 get-pip.py --user
One of the reasons this works so well is that, well, get-pip is
actually pip itself (the full version, as far as I know). The comment
at the start of
get-pip.py explains what is going on so well that
I am just going to quote it wholesale:
You may be wondering what this giant blob of binary data here is, you might even be worried that we're up to something nefarious (good for you for being paranoid!). This is a base85 encoding of a zip file, this zip file contains an entire copy of pip (version 20.2.4).
Pip is a thing that installs packages, pip itself is a package that someone might want to install, especially if they're looking to run this get-pip.py script. Pip has a lot of code to deal with the security of installing packages, various edge cases on various platforms, and other such sort of "tribal knowledge" that has been encoded in its code base. Because of this we basically include an entire copy of pip inside this blob. We do this because the alternatives are attempt to implement a "minipip" that probably doesn't do things correctly and has weird edge cases, or compress pip itself down into a single file.
As a sysadmin, I fully support this very straightforward and functional approach to bootstrapping pip. The get-pip.py file that results is large for a Python program, but as installers (and executables) go, 1.9 Mbytes is not all that much.
However, there is a wrinkle probably coming up in the near future. Very soon, versions of pip itself will stop supporting Python 2; the official statement (currently here) is:
pip 20.3 was the last version of pip that supported Python 2. [...]
(The current version of pip is 20.3.3.)
The expected release date of pip 21.0 is some time this month. At some time after that point, get-pip.py may stop supporting Python 2 and you (I) will have a more difficult time bootstrapping the Python 2 version of pip on any machine I still need to add it on. Of course, at some point I will also stop having any use for a Python 2 pip, because the Python language server itself will drop support for Python 2 and I won't have any reason to upgrade my Python 2 version of it.
(Pip version 21.0 should fix, or at least work around, a long stall on startup that's experienced in some Linux configurations.)
PS: What PyPy will do about this is a good question, since they are so far planning to support Python 2 for a very long time. Perhaps they will freeze and ship pip 20.3.3 basically forever.
In Python 3, types are classes (as far as
repr() is concerned)
In yesterday's entry, I put in a little
aside, saying 'the distinction between what is considered a 'type'
and what is considered a 'class' by
repr() is somewhat arbitrary'.
It turns out that this is not true in Python 3, which exposes an
interesting difference between Python 2 and Python 3 and a bit of
old Python 1 and Python 2 history too.
(So the sidebar in this old entry of mine is not applicable to Python 3.)
To start with, let's show the situation in Python 2:
>>> class A: ... pass >>> class B(object): ... pass >>> repr(A) '<class __main__.A at 0x7fd804cacf30>' >>> repr(B) "<class '__main__.B'>" >>> repr(type) "<type 'type'>"
Old style and new style classes in Python 2 are reported slightly
differently, but they are both 'class', while
type (or any other
built in type such as
int) are 'type'. This distinction is made
at a quite low level, as described in the sidebar in my old entry.
However, in Python 3 things have changed and
repr()'s output is
>>> class B(object): ... pass >>> repr(B) "<class '__main__.B'>" >>> repr(type) "<class 'type'>"
Both Python classes and built-in types are 'class'. This change was specifically introduced in Python 3, as issue 2565 (the change appeared in 3.0a5). The issue's discussion has a hint as to what was going on here.
To simplify a bit, in Python 1.x, there was no unification between
classes and built in types. As part of this difference, their
repr() results were different in the way you'd expect; one said
'class' and the other said 'type'. When Python 2.0 came along, it
unified types with new style classes. The initial implementation
of this unification caused
repr() to report new style classes as
types. However, at some point relatively early in 2.x development,
this code was changed to report new style classes as 'class ...'
instead. What was reported for built in types was left unchanged
for backwards compatibility with the Python 1.x output of
In the run up to Python 3, this backwards compatibility was removed
and now all built in types (or if you prefer, classes) are reported
(I was going to say something about what
type() reports, but then
I actually thought about it. In reality
type() doesn't report any
sort of string;
type() returns an object, and if you're just
running that in an interactive session the interpreter prints it
str(), which for classes is normally the same as
The reason to use '
repr(B)' instead of '
type(B)' in my interactive
example is that '
Sidebar: The actual commit message for the 2001 era change
issue 2565 doesn't quote the full commit message, and it turns out that the omitted bit is interesting (especially since it's a change made by Guido van Rossum):
Change repr() of a new-style class to say <class 'ClassName'> rather than <type 'ClassName'>. Exception: if it's a built-in type or an extension type, continue to call it <type 'ClassName>. Call me a wimp, but I don't want to break more user code than necessary.
As far as I can tell from reading old Python changelogs, this change appeared in Python 2.2a4. In a way, this is surprisingly late in Python 2.x development. The 'what's new' snippet about the change reiterates that not changing the output for built in types is for backward compatibility:
The repr() of new-style classes has changed; instead of <type 'M.Foo'> a new-style class is now rendered as <class 'M.Foo'>, except for built-in types, which are still rendered as <type 'Foo'> (to avoid upsetting existing code that might parse or otherwise rely on repr() of certain type objects).
Of course, at that point it was also for compatibility with people relying on what repr() of built in types reported in 2.0 and 2.1.
In CPython, types implemented in C actually are part of the type tree
In Python, in theory all types descend from
object (they are
direct or indirect subclasses of it). For years, I've believed (and
written) that this was not the case at the implementation level for
types written in native C code in CPython (the standard implementation
of Python and the one you're probably using). Types written in C
might behave as if they descended from
object, but I thought their
behavior was actually entirely stand-alone, implemented by each
type separately in C. Courtesy of Python behind the scenes #6:
how Python object system works,
I've discovered that I'm wrong.
In CPython, C level Python types are not literally subclasses of
the C level version of
object, because of course C doesn't have
classes and subclasses in that sense. Instead, you usually describe
your type by defining a
PyTypeObject struct for it, with
all sorts of fields that you fill in or don't fill in as you need
them, including a
for your base type (if you want more than one base type, you need
to take the alternate path of a heap type). When
CPython needs to execute special methods or other operations on
your type, it will directly use fields on your
structure (and as far as I know, it only uses those fields, with
no fallbacks). On the surface, this looks like the
is essentially decorative and is only used to report your claimed
__base__ if people ask.
However, there is a bit of CPython magic hiding behind the scenes.
In order to actually use a
PyTypeObject as a type, you must
register it and make it ready by calling
PyType_Ready. As part
PyType_Ready will use your type's
tp_base to fill
in various fields of your
PyTypeObject if you didn't already do
that, which effectively means that your C level type will inherit
those fields from its base type (and so on all the way up to
object). This is outlined in a section of the C API, but of course
I never read the C API myself because I never needed to use it.
The how [the] Python object system works
article has more details on how this works, if you're curious, along
with details on how special methods also work (which is more
interesting than I had any idea, and I've looked at this area
(The distinction between what is considered a 'type' and what is
considered a 'class' by
repr() is somewhat arbitrary; see the
sidebar here. C level things defined with
PyTypeObject will probably always be considered types instead of
Using constant Python hash functions for fun and no real profit
In one of the examples of wtfpython,
the author uses a constant
__hash__ function in order to make a
version of plain dicts and ordered dicts that can be put in a set.
When I saw this, I had some reactions.
My first reaction was to wonder if this was safe. With a lot of
qualifications, the answer is yes. Two important qualities of a
__hash__ function are that it always return the same result for
a given object and that it returns the same hash for any two objects
that potentially compare the same (see also understanding hashing
in Python). Returning a constant (here '0')
makes both trivially true, provided that your objects cannot be
equal to anything other than other instances of your class (or
classes). Returning a constant hash for instances that aren't going
to compare equal is safe, as object hashes don't have to be unique.
(This doesn't mean that you can safely mutate instances of your classes in ways that affect their equality comparison. Doing so is a great way to get two copies of the same key in a dict or a set, which is likely to be bad.)
My second reaction was to wonder if this was useful, and I think the answer is generally not really. The problem with a constant hash function is that it's going to guarantee dictionary key collisions for any such objects that you add to the dict or set. If you put very many objects with the same key into a dict (or a set), checking for a given key turns into doing an equality check on all of the other keys you've already added. Adding an entry, getting an entry, checking whether an entry is there, whatever, they all become a linear search.
If you don't have very many objects in total in a dict this is probably okay. A linear search through ten or twenty objects is not terrible (hopefully the equality check itself is efficient). Even a linear search through a hundred might be tolerable if it's important enough. But after a certain point you're going to see visible and significant slowdowns, and it would be more honest to use a list instead of a dict or set (since you're effectively getting the performance of a list).
If you need to do better, you probably want to go all of the way to implementing some sort of proper hash function that implements the rules of hashing in Python. If you're willing to live daringly, you don't have to make your objects literally immutable once created, you just have to never mutate them while they're in a dict or a set.
Logging fatal exceptions in my Python programs is not enough
We have a few Python programs which run automatically, need to produce very rigid output (or lack of output) to standard output and even standard error, and are complex enough (and use enough outside code) that they may reasonably run into unhandled exceptions. One example is our program to report on email attachment type information under Exim; this runs a lot of code on untrusted input, and our Exim configuration expects its output to have a pretty rigid format (cf). Allowing Python to dump out the normal unhandled exception to standard error is not what we wanted. So for years that program has had a chunk of top level code to catch and syslog otherwise unhandled exceptions. I wrote it, deployed it, and considered it all good.
The other day I discovered that this program had been periodically experiencing, catching, and dutifully syslogging an exception about an internal error (caused by a package we use), going back months. In fact, more than one error about more than one thing. I hadn't known, because I don't normally go look through the logs for these exception traces. Why would I? They aren't supposed to happen and they mostly don't happen, and humans are very bad at consistently looking for things that don't happen.
Django has a very nice feature where it will email error reports to you, which has periodically been handy here. I'm not sure I trust myself to write that much code that absolutely must run, but I certainly could make my exception logging code also run an external script with very minimal arguments and that script could email me to notify me. Since the exception is being logged, I don't need a copy in email; I just need to know that I should go look at the logs.
(Django emails the whole exception along with a bunch of additional information, but I believe the email is the only place that information is captured. There are various tradeoffs here, but my starting point is that I'm already logging the exception.)
I could likely benefit from going through PyPI to see how other people have solved this particular problem, and maybe even use their code rather than write my own. I've traditionally avoided outside packages, but we're already using a bunch of them in this program as it is and I should probably get over that hangup in general.
(It helps that I'm slowly acquiring a better understanding of using
pip in practice.)
In Python, using the
logging package is part of your API, or should be
We have a Python program for logging email attachment type information. As part of doing this, it wants to peer inside various sorts of archive types to see what's inside of them, because malware puts bad stuff there. One of the Python modules we use for this is the Ubuntu packaged version of libarchive-c, which is a Python API for libarchive. Our program prints out information in a very specific output format, which our Exim configuration then reads and makes use of.
Very recently, I was looking at our logs for an email message and noticed that it had a very unusual status report. Normal status reports look like this:
1kX88D-0004Mb-PR attachment application/zip; MIME file ext: .zip; zip exts: .iso
This message's status report was:
Pathname cannot be converted from UTF-16BE to current locale.
That's not a message that our program emits. It's instead a warning
message from the C libarchive library. However, it is not printed
out directly by the C code; instead this report is passed up as an
additional warning attached to the results of library calls. It is
libarchive-c that is deciding to print it out, in a general
FFI support function.
More specifically, libarchive-c is deciding to 'log' it through the
package; the default logging environment then prints it out to
(Our program does not otherwise use
logging, and I had no
idea it was in use until I tried to track this down.)
A program's output is often part of its API in practice. When code
does things that in default conditions produces output, this alters
the API of the program it is in. This should not be done casually.
If warning information should be exposed, then it should be surfaced
through an actual API (an accessible one), not thrown out randomly.
If your code does use
logging, this should be part of its
documented API, not stuffed away in a corner as an implementation
detail, because people will quite reasonably want to know this
(so they can configure
logging in general) and may want to
turn it off.
In a related issue, notice that libarchive-c constructs the
logger it will use at
import time (here),
before your Python code normally will have had a chance to configure
logging, and will even use it at import time under some circumstances
as it is dynamically building some bindings. I suspect that it is
far from alone as far as constructing and even using its logger at
import time goes.
(It's natural to configure logging as part of program startup, in
main() function or something descending from it, not at program
load time before you start doing
imports. This is especially the
case since how you do logging in a program may depend on command
line arguments or other configuration information.)
(This is the background for this tweet of mine.)
global statement and imports in functions
Python programmers are familiar with the
which is how Python lets you assign to global variables inside
functions (otherwise any variable that's assigned to is assumed to
be a local variable). Well, that's not quite what
In languages like C, global variables must exist before you can use
them in a function. In common Python usage of
global, the variable
is created at the module (global) level and then assigned to inside
a function, in the rough analog of the C requirement:
aglobal = Falsedef enable_thing(): global aglobal aglobal = True
There are good reasons to always create the variables at the module
level, but Python does not actually require that you do this. You
can actually create a new module level variable inside a function
def set_thing(): global a_new_name a_new_name = <something>
(If you read between the lines of the language specification, you can see that this is implied.)
Now, suppose that you want to import a another module as part of
initializing some things, but not do it when your module is
(for example, you might be dealing with a module that can be very
slow to import). It turns out that you can
do this; with
global you can import something for module-wide
use inside a function. The following works:
def import_slowmodule(): global slowmodule import slowmodule def use_slowmodule(): slowmodule.something()
If you do
import inside a function, it normally binds the imported
name only as a function local thing (as
import defines names
in the local scope).
global changes that; when the module's name (or whatever
you're importing it as) is set as a global identifier,
binds the name at the module level.
(The actual CPython bytecode does imports in two operations; there
IMPORT_NAME and then some form of
STORE_FAST or, with a
global in effect,
This is sufficiently tricky and clever that if you need to use it,
I think you should put a big comment at the top of the file to
explain that there is a module that is conditionally imported at
the module level that is not visible in your normal
Otherwise, sooner or later someone is going to get rather confused
(and it may be a future you).
An illustration of why running code during
import is a bad idea (and how it happens anyway)
It's a piece of received wisdom in Python programming that while
you can make your module run code when it's
import'd, you normally
shouldn't. Importing a module is supposed to be both fast and
predictable, doing as little as possible. But this rule is not always
followed, and when it's not followed you can get bad results:
If you've remotely logged in to a Fedora machine (and have no console session there) and the python3-keyring package is installed, 'python3 -c "import keyring"' takes 25 seconds or so as the module tries to talk to keyrings on import and waits for some long timeouts. Nice work.
On the one hand this provides yet another poster child of why running code on import is very bad, since merely importing a module should clearly not stop your Python program for 25 seconds. On the other hand, I think that this case makes an interesting illustration of how it is possible to drift into this state through a reasonably sensible API choice.
Keyring has a notion of backends, which actually talk to the
various different system keyring services. To use keyring, you need
to pick a backend to use and initialize it, and by 'you' we mean
'keyring', because people calling keyring just want to use a generic
API without having to care what backend is in use on this system.
So when you import the
keyring module, core.py
picks and initializes a backend during the import:
# init the _keyring_backend init_backend()
Automatically selecting and initializing a backend on import means that keyring's API is ready for callers to use right away without any further work. This is a friendly API, but assumes that everyone who imports keyring will go on to use it. While this sounds reasonable, a Python program may only need to talk to the keyring for some operations under some circumstances, and may mostly never use it. One such program is pip, which needs the keyring only rarely but imports it all of the time.
(Unconditional imports are the obvious and Pythonic thing to do.
People look at you funny if your program does '
import' in a
function or a class, and it's harder to use the result.)
However, selecting the backend on import has a drawback, at least on Linux, which is that keyring has to figure out which system keyring services are actually active right now, because in the Linux way there's more than one of them (keyring supports SecretStorage and direct use of KWallet, plus third party plugins). Since keyring has decided to choose the backend it will use at import time, it has to determine which of its supported system keyring services are active at import time.
Some of keyring's backends determine whether or not the corresponding
system service is active by trying to make a DBus connection to the service.
Under the right (or the wrong) circumstances, this DBus action
can stall for a significant amount of time. For instance, you
can see this in the kwallet backend code;
it attempts to get the DBus object /modules/kwalletd5 from
org.kde.kwalletd5. Under some circumstances, this DBus action can
fail only after a long timeout, and now you have a 25 second
This import delay isn't a simple case where the keyring module is running a bunch of heavyweight code. Instead keyring is doing a potentially dangerous operation by talking to an outside service during import. It's not necessarily obvious that this is happening, because you need to understand both what happens in a specific backend and what's done at import time (and in isolation each piece sounds sensible). And a lot of time talking to the outside service will either work fine and be swift, or will fail immediately.
An issue with Pip installed packages and Python versions (on Unix)
Suppose, not hypothetically, that you want to install pyls, a LSP server for
Python, so that you can use it with (for example) GNU Emacs'
lsp-mode. Pyls is probably not packaged for
your Unix (it's not for Fedora or Ubuntu), but you can install it
with Pip (since it's on PyPi), either as
sudo pip install' to install it system wide (which may conflict
with your package manager) or as '
pip install --user' to install
it just for you.
(If this is a shared Unix machine, you probably need to do the latter.)
Then you upgrade your Unix version (or it gets upgraded), for example
from Fedora 31 to Fedora 32. Suddenly the
pyls program doesn't
work any more and even more puzzlingly, '
pip list --user' doesn't
even list anything. It's as if your personal installation of pyls
was somehow wiped out by the upgrade.
What's going is that pip installs things under a path that is specific to the minor version of Python, and when the minor version changes in the upgrade, the new version of Python doesn't find your old packages because it's looking in a different place. Fedora 31 had Python 3.7, which expects to find your personal packages in ~/.local/lib/python3.7/site-packages, where pip put them for you. Fedora 32 has Python 3.8, which expects to find the same packages in ~/.local/lib/python3.8/site-packages, and ignores the versions in python3.7/site-packages.
(The same thing happens on Ubuntu, where 18.04 LTS has 3.6.9 and 20.04 LTS has 3.8.5.)
As far as I can see there is no good way out of this. The same thing
happens if you install things system wide with '
sudo pip install'
(and I hope you kept notes on what you installed through pip and
what was already put there by the system). I think that it also
happens if you put pyls into a venv because venvs normally use
the system Python and inherit this version
specific site-packages directory.
(There is a '
python3 -m venv --upgrade <dir>' venv command to
upgrade the version of Python in a venv, but looking at the code
suggests that it doesn't do anything to migrate installed packages
to the new version. I can't test this, though, so perhaps I'm missing
My personal solution was to just rename the ~/.local/lib/python3.7 directory to 'python3.8'. Pip seems happy with the result, as does pyls. The more correct approach is probably to restart from scratch and reinstall all packages and programs like pyls.
(This elaborates on a tweet of mine. At the time of the tweet I hadn't realized that this applies to basically all uses of pip to install things, not just 'pip --user'.)
Fifteen years of DWiki, the Python engine of Wandering Thoughts
DWiki, the wiki engine that underlies Wandering Thoughts (this blog), is fifteen years old. That makes it my oldest Python program that's in active, regular, and even somewhat demanding use (we serve up a bunch of requests a day, although mostly from syndication feed fetchers and bots on a typical day). As is usual for my long-lived Python programs, DWiki's not in any sort of active development, as you can see in its github repo, although I did add a an important feature just last year (that's another story, though).
DWiki has undergone a long process of sporadic development, where I've added important features slowly over time (including performance improvements). This sporadic development generally means that I come back to DWiki's code each time having forgotten much of the details and have to recover them. Unfortunately this isn't as easy as I'd like and is definitely complicated by historical decisions that seemed right at the time but which have wound up creating some very tangled and unclear objects that sit at the core of various important processes.
(I try to add comments for what I've worked out when I revisit code. It's probably not always successful at helping future me on the next time through.)
DWiki itself has been extremely stable in operation and has essentially never blown up or hit an unhandled exception that wasn't caused by a very recent code change of mine. This stability is part of why I can ignore DWiki's code for long lengths of time. However, DWiki operates in an environment where DWiki processes are either transient or restarted on a regular basis; if it was a persistent daemon, more problems might have come up (or I might have been forced to pay more attention to reference leaks and similar issues).
Given that it's a Unix based project started in 2005, Python has been an excellent choice out of the options available at the time. Using Python has given me long life, great stability in the language (since I started as Python 2 was reaching stability and slowing down), good enough performance, and a degree of freedom and flexibility in coding that was probably invaluable as I was ignorantly fumbling my way through the problem space. Even today I'm not convinced that another language would make DWiki better or easier to write, and most of the other options might make it harder to operate in practice.
(To put it one way, the messy state of DWiki's code is not really because of the language it's written in.)
Several parts of Python's standard library have been very useful
in making DWiki perform better without too much work, especially
various pickle modules make it essentially trivial to serialize an
object to disk and then reload it later, in another process, which
is at the core of DWiki's caching strategies. That you can pickle
arbitrary objects inside your program without having to make many
changes to them has let me easily add pickle based disk caches to
various things without too much effort.
At the same time, the very strong performance split in CPython between things implemented in C and things implemented in Python has definitely affected how DWiki is coded, not necessarily for the better. This is particularly obvious in the parsing of DWikiText, which is almost entirely done with complex regular expressions (some of them generated by code) because that's by far the fastest way to do it in CPython. The result is somewhat fragile in the face of potential changes to DWikiText and definitely hard for me to follow when I come back to it.
(With that said, I feel that parsing all wikitext dialects is a hard problem and a high performance parser is probably going to be tricky to write and follow regardless of the implementation language.)
DWiki is currently written in Python 2, but will probably eventually be ported to Python 3. I have no particular plans for when I'll try to do that for various reasons, although one of the places where I run a DWiki instance will probably drop Python 2 sooner or later and force my hand. Right now I would be happy to leave DWiki as a Python 2 program forever; Python 3 is nicer, but since I'm not changing DWiki much anyway I'll probably never use many of those nicer things in it.