PyPy starts fast enough for our Python 2 commands
Some day, Linux distributions like Ubuntu are not going to package Python 2 even in a limited version, the way they're doing now. One way for us to deal with this would be to migrate all of our remaining little Python 2 programs and scripts to Python 3. Another option is to run them under PyPy, which says that it will always support Python 2.7.
One of the potential issues with PyPy is that its JIT has a high warm-up cost, which means that small, short-running programs are going to be slower, perhaps significantly slower. Most of the Python 2 that we have left is in small administrative commands that are mostly run automatically, where on the one hand I would expect PyPy's overhead to be at its largest and on the other hand we probably don't really care about the overhead if it's not too big. So I decided to do some quick tests.
(I've been hit by the startup overhead of small programs in Python even without PyPy, but it was in an interactive situation.)
I did my tests on one of our Ubuntu 20.04 servers, which has PyPy version 7.3.1, and the results turned out to be more interesting than I expected. The artificial and irrelevant worst case was a Python 3 program that went from about 0.05 second to about 0.17 second (under pypy3) to actually do its work. Our typical small Python 2 commands seem to go from 0.01 or 0.02 second to about 0.07 second or so. The surprising best case was a central program used for managing our password file, where the runtime under PyPy actually dropped from around 0.40 second to 0.33 second. And a heavily multithreaded program that runs a lot of concurrent ssh commands had essentially the same runtime on a different 20.04 machine.
(In retrospect, the password file processing program does have to process several thousand lines of text, so perhaps I should not have been surprised that it's CPU-intensive enough for PyPy to speed it up. Somehow it's in my mind as a small, lightweight thing.)
All of this says that PyPy starts (and runs) our Python programs more than fast enough to serve us as an alternate implementation of Python 2 if we need to turn to it.
Packaging Python 2 doesn't mean that Linux distributions support it
One of the reasons I've been optimistic about Python 2's continued
afterlife for at least a few more years is that various Linux
distributions with long term support have packaged it in versions with
support that would last for years to come. Those distributions would
provide fixes for any security issues that came up, as they do for all
of their packages (more or less), and people running Python 2 elsewhere
could take those updated versions of Python 2, recompile them, and
use them even on platforms without that sort of support. The recent
ctypes security issue was the first serious test
of my optimistic belief. I'm afraid to report that it has partially
As I write this, most Linux distributions that still provide Python 2 have provided an updated Python 2 package that fixes this issue; for instances, Fedora is updated. The relatively glaring exception that I know of is Ubuntu in 20.04 LTS. Although Ubuntu had an initial stumble in the updates for 16.04 LTS and 18.04 LTS, they have fixed them by now. Unfortunately there's no sign of any update for 20.04 LTS. Ubuntu knows that an update is needed (per their page for CVE-2021-3177), and they have the code update that they need (since they've fixed this in 18.04 and 16.04, including their fixed fix), but they aren't doing anything.
At one level this has surprised me. At another level, it shouldn't have. All of the Linux distributions have been clear that they want to get rid of Python 2 and are only still providing it reluctantly. In retrospect, it was optimistic to assume that despite this reluctance, all of the distributions would always still fix issues in all versions of Python 2 instead of shrugging and pointing out that in general, Python 2 had explicitly reached the end of its life. What's happened in Ubuntu 20.04 so far may be an accident, but it shouldn't surprise me if some day Linux distributions start doing this deliberately.
(Fortunately I don't think this issue is serious for us, so for now I feel that we're okay even on 20.04.)
PS: Not all Linux distributions are likely to stop updating Python 2. Red Hat Enterprise Linux especially has a serious commitment to long term bug fixes, so I do expect them to keep fixing their version of Python 2 for as long as they provide it in a supported RHEL version. Well, probably. Some things involving Red Hat Enterprise Linux have been shaken up recently.
ctypes security issue and Python 2
In the middle of February, the Python developers revealed that Python had been affected by a buffer overflow security issue, CVE-2021-3177. The relatively full details are covered in ctypes: Buffer overflow in PyCArg_repr, and conveniently the original bug report has a very simple reproduction that can also serve to test your Python to see if it's fixed:
$ python2 Python 2.7.17 (default, Feb 25 2021, 14:02:55) >>> from ctypes import * >>> c_double.from_param(1e300) *** buffer overflow detected ***: python2 terminated Aborted (core dumped)
(A fixed version will report '
<cparam 'd' (1e+300)>' here.)
The official bug report only covers Python 3, because Python 2.7 is not supported any more, but as you can see here the bug is present in Python 2 as well (this is the Ubuntu 18.04 version, which is unfixed for reasons).
I'm on record as saying that it was very unlikely for security issues to be discovered in Python 2 after this long. Regardless of how significant this issue is in practice, I was and am wrong. A buffer overflow has lurked in the standard Python library, including Python 2, and was only discovered after official Python support for Python 2 has stopped. There have been other recent security issues in Python 3, per Python security vulnerabilities, and some of them may also apply to Python 2 and be significant for you.
(Linux distributions are still fixing issues like this in Python 2. Well, more or less. Ubuntu hasn't worked out a successful fix for 18.04 and hasn't even tried one for 20.04, but Fedora has fixed the issue.)
This CVE is not an issue for our Python 2 code, where we don't use
ctypes. But it
does make me somewhat more concerned about our remaining Python 2
programs, for the simple reason that I was wrong about one of my beliefs
about Python 2 after its end of support. To use a metaphor, what I
thought was a strong, well-inspected pillar has turned out to have some
previously unnoticed cracks of a sort that matter, even if they've
not yet been spotted in an area that's load-bearing for us. Also,
now I should clearly be keeping an eye on Python security issues and
testing new ones (if possible) to see if they apply to Python 2. If they
do, we'll need to explicitly consider what programs of ours might be
(The answer is often likely to be 'no programs are affected'. but we can no longer take for granted that the issues are not serious and don't affect Python 2 or us.)
As far as the severity of this issue goes, on the one hand buffer overruns are quite bad, but on the other hand this is in what is a relatively obscure corner of Python for most people. This is not the sort of Python security issue that would let people break ordinary Python 2 programs (and I still think that those are very unlikely by now). But I'm a bit biased here, since we're not going to drop everything and port all of our remaining Python 2 programs to Python 3 right now (well, not unless we absolutely have to).
(People's views of the severity may vary; these are just mine.)
PS: To be explicit, this issue has not changed my view that it's
reasonable (and not irresponsible) to continue running Python 2
programs and code. This is not a great sign for people who use
ctypes, but it's not a fatal vulnerability or a major problem sign.
Where the default values for Python function arguments are stored
One of the things that surprised me when I was researching yesterday's
entry on using
is with literals was that I
couldn't work out where (C)Python kept the default values for
function arguments. In the end I didn't need to know for sure because
I was able to demonstrate that function default argument values are
with constants used in the function code, but it bugged me. Today
I worked it out and now I can show some more interesting things.
In the end, finding the answer was as simple as
reading the documentation for the
inspect module. Constants
used in code are found in the
co_consts attribute on code objects,
but the default values for function arguments are found in the
__kwdefaults__ attributes of function objects.
Once I thought about it this split made a lot of sense. Code objects
can come from many sources (for instance,
compile()) and not all of
those sources actually have any concept of arguments (with or without
default arguments). So attaching 'default values for function arguments'
to code objects would be wrong; they need to go on function objects,
where they make sense.
(The difference in naming style is (likely) due to Python 3 limiting
how much code it was willing to break and force people to update.
In Python 2, function default argument values are exposed in a
func_defaults attribute, along with a number of other
ones. In Python 3, all of those were renamed to
versions, while code objects had their attribute names left alone.
If Python was being recreated from scratch today, I suspect that
code objects would have only
__<name>__ attributes too.)
This means that CPython's constant interning is being somewhat more
clever than I expected. Since default argument values don't go in
co_consts, CPython is somehow building an overall constant pool,
then (re)using it for both
CPython is definitely making use of the same objects in these two
attributes, which I can now demonstrate in a different and more
direct way than I did in the last entry:
>>> def a(b=3000): ... return b == 3000 >>> a.__defaults__ (3000,) >>> a.__code__.co_consts (None, 3000) >>> a.__defaults__ is a.__code__.co_consts True
One of the minor uses of function
__defaults__ that I can see is to
examine the current state of function default argument values, just in
case someone has managed to mutate one of them.
PS: In reading the CPython code, I discovered that you can actually
set new values for these default argument values by assigning to
__defaults__. This is described in the Python data model
of implicitly (because it lists the
__defaults__ field as
writable, and what other effects would that have).
An interesting issue around using
is with a literal in Python
For reasons outside of the scope of this entry, I recently installed Ubuntu's package (for Ubuntu 20.04) of the Python netaddr on a system of ours. When I did, I got an interesting Python warning that I hadn't seen before:
SyntaxWarning: "is not" with a literal. Did you mean "!="?
I was curious enough to look up the code in question, which boils down to something that looked like this:
def int_to_bits(int_val, word_size, num_words, word_sep=''): [...] if word_sep is not '': [...]
(The current code
replaces this with a '
!=' comparison, which is what the other
similar code in that file uses. Ubuntu being Ubuntu, they will
probably never update or fix the 20.04 'python3-netaddr' package.)
The intention of this code is clear; it wants to check if you supplied
your own word_sep argument. On the one hand, using '
here is not the correct thing to do. When you use '
is not' this way
you need to have a sentinel object, not a sentinel value, and this
code uses the value
'', the empty string. On the other hand, this
code actually works, for at least three reasons. One of them might be
The first reason the code works is mechanical, because I left out
the body of the
if and the rest of the code that actually uses
word_sep. Here is the almost full code:
if word_sep is not '': if not _is_str(word_sep): raise ValueError(...) return word_sep.join(bit_words)
So the only thing the code does differently if it thinks that it has a
non-default word_sep is check that it really is a string. Since the
empty string passes that check, everything is fine. Given this, the
isn't all that necessary; you could just as well always check to see
that word_sep is a string. However this first reason is specific
to the code itself.
The second and third reasons are general, and would happen regardless
of what use the code made of word_sep and what it did in the
I'll start by presenting the second reason in illustrated form:
>>> def a(b=''): ... return b is not '' ... <stdin>:2: SyntaxWarning: "is not" with a literal. Did you mean "!="? >>> a() False >>> a(b='') False
In CPython, a number of specific strings and other (immutable)
values are what is called interned. Regardless of
how many times they're used in different places all over your Python
code, there's only ever one instance of these values. For instance,
there is only one instance of an empty tuple, '
()', and only one
instance of many small integers. Integers are especially useful to
illustrate this vividly, because you can manipulate current ones
to create new values:
>>> a = 10 >>> b = 5 >>> c = 4 >>> (b+c+1) is a True
If you change
a to be 300 and
b to be 295, this will be
(as of Python 3.8.7).
The empty string,
'', is one of those interned (string) values. All
copies of the empty string are the same objects, regardless of where
they come from. Because they're the same object, you can use '
is') to compare values to them and it will always work. This is
of course not guaranteed by the language specification or by CPython,
but it's such a fundamental optimization that it would be very unusual
if it ever stopped being the case. Still, you should use '
!=' and not
be so tricky.
The third reason is best presented in illustrated form again:
>>> def a(b=3000): ... return b is 3000 [...] >>> a() True >>> a(b=3000) False
This is another CPython optimization, but it's an optimization
within a single function. When CPython is generating the bytecode
for a function it's smart enough to only keep one copy of every
constant value, and this merging of constants includes the default
arguments. So within the
a function, the integer '3000' of the
b default value and the integer literal '3000' from the code are
the same object and '
is' will tell you this. However, an integer
of '3000' that comes from the outside is a different object (since
3000 is a large enough integer that Python doesn't intern it).
This optimization is probably going to stay in CPython, but I would
strongly suggest that you not take advantage of it in your code. Just do
as the warning says and don't use '
is' or '
is not' on literals. The
very slight performance improvement you might get from exploiting this
isn't worth the confusion you're going to create.
Time for Python 2 users to make sure we have a copy of Pip and other pieces
The news of the time period is that as the Pip developers said they
would, the just released Pip 21.0 has dropped support for Python
2 (via). In
theory this doesn't matter for modern users of Python 2.7, because
Python itself should ship with a bundled version of pip so that you
don't have to install one from scratch. In practice, some Linux
distributions split the
pip command off into a separate sub-package
and no longer make a Python 2 version available, although they
continue to ship Python 2 itself for legacy uses (this is the case
in Fedora 32 and later). Now that Pip no longer supports Python 2,
I wouldn't be surprised if more Linux (and Unix) distributions did
this, because Pip's change means that they need to ship two different
versions of Pip.
The release of Pip 21.0 and all of this has made me realize that now is
a good time for us (and for all Python 2 users) to make sure that we
have our own working copy of Pip, quite possibly the latest one. This
is unfortunately now more difficult than it used to be just a week or
two ago, because the special
Python program is now Pip 21.0, and so won't work with Python 2 any
more. Fortunately you can still find the older version of
for Python 2.7. You
might want to save a copy, or go all the way to making a '--user'
install of the Python 2 pip in some user account and saving the
.local/lib/python2.7/site-packages/pip and other artifacts that you get.
To the best of my knowledge they can be moved around freely between
Of course, Pip itself relies on the Python Package Index (PyPI). My understanding is that PyPI has not made any
announcements about dropping support for Python 2 packages or for
Python 2 versions of
pip fetching packages, but I wouldn't count
on either being available forever. If you have additional dependencies
for your Python 2 programs, it's probably a good time to make sure
you have local copies of them. This especially includes dependencies
that you get through packages provided by your Linux distribution
(for example as Ubuntu packages), because pretty much every Linux
distribution will be dropping most or all of their additional Python
2 packages soon (if they haven't already).
(If nothing else, someday PyPI may change its API in a way that requires changes to Pip and other programs that talk to it.)
I also wouldn't be surprised if Pip's move prompts more third party Python packages to drop support for Python 2, which is of course a movement that's been going on for some time now. Presumably this doesn't matter much to most Python 2 people, who have probably already more or less frozen their package versions.
(Fortunately we have very little to worry about. I believe that almost all of our remaining Python 2 code uses only built in modules, not third-party packages. Our major consumer of third party packages is already a Python 3 program.)
PS: While PyPy is going to provide a Python 2.7 implementation for the foreseeable future, I wouldn't count on the rest of the Python ecosystem to support it (Pip included, obviously). People who use the Python 2 PyPy, perhaps someday including us, will be on their own.
Installing Pip in Python 2 environments that don't provide it already
In theory any modern version of Python 2 (or Python 3) is bundled
pip, although it may be an out of date version that you could
update (with something like '
python2 -m pip install --user --upgrade
pip'). In practice, some Linux distributions split
pip off into
its own package and have stopped providing this separate package
for their version of Python 2. This is definitely the case for
Fedora 32, and may soon be the case for other distributions. If you
still want a Python 2 version of Pip (for example so that you can
keep updating the Python 2 version of the Python language server), you need to install one by hand, somehow.
When I had to do this on my Fedora 32 machine I was lucky enough
that I had already done an update of the Python 2 pip on one machine
where I used '
--user' to install the new version in my $HOME, so
I had all of the Pip code in .local/lib/python2.7/site-packages and
could just copy it over, along with .local/bin/pip2. It turns out
that this simple brute force approach is probably not necessary
and there is a completely convenient alternative, which is different
than the situation I expected before I started writing this entry.
(Since pip is normally installed with your Python, I expected that bootstrapping pip outside of that was not very well supported because it was infrequently used. For whatever reason, this is not at all the case currently.)
The pip people have an entire document on installing pip that walks you through a number of options. The important one for my case is Installing with get-pip.py, where you download a get-pip.py Python program to bootstrap pip. One of the options it supports is installing pip as a user package, resulting in a .local/bin/pip2 for you to use. The simple command line required is:
python2 get-pip.py --user
One of the reasons this works so well is that, well, get-pip is
actually pip itself (the full version, as far as I know). The comment
at the start of
get-pip.py explains what is going on so well that
I am just going to quote it wholesale:
You may be wondering what this giant blob of binary data here is, you might even be worried that we're up to something nefarious (good for you for being paranoid!). This is a base85 encoding of a zip file, this zip file contains an entire copy of pip (version 20.2.4).
Pip is a thing that installs packages, pip itself is a package that someone might want to install, especially if they're looking to run this get-pip.py script. Pip has a lot of code to deal with the security of installing packages, various edge cases on various platforms, and other such sort of "tribal knowledge" that has been encoded in its code base. Because of this we basically include an entire copy of pip inside this blob. We do this because the alternatives are attempt to implement a "minipip" that probably doesn't do things correctly and has weird edge cases, or compress pip itself down into a single file.
As a sysadmin, I fully support this very straightforward and functional approach to bootstrapping pip. The get-pip.py file that results is large for a Python program, but as installers (and executables) go, 1.9 Mbytes is not all that much.
However, there is a wrinkle probably coming up in the near future. Very soon, versions of pip itself will stop supporting Python 2; the official statement (currently here) is:
pip 20.3 was the last version of pip that supported Python 2. [...]
(The current version of pip is 20.3.3.)
The expected release date of pip 21.0 is some time this month. At some time after that point, get-pip.py may stop supporting Python 2 and you (I) will have a more difficult time bootstrapping the Python 2 version of pip on any machine I still need to add it on. Of course, at some point I will also stop having any use for a Python 2 pip, because the Python language server itself will drop support for Python 2 and I won't have any reason to upgrade my Python 2 version of it.
(Pip version 21.0 should fix, or at least work around, a long stall on startup that's experienced in some Linux configurations.)
PS: What PyPy will do about this is a good question, since they are so far planning to support Python 2 for a very long time. Perhaps they will freeze and ship pip 20.3.3 basically forever.
In Python 3, types are classes (as far as
repr() is concerned)
In yesterday's entry, I put in a little
aside, saying 'the distinction between what is considered a 'type'
and what is considered a 'class' by
repr() is somewhat arbitrary'.
It turns out that this is not true in Python 3, which exposes an
interesting difference between Python 2 and Python 3 and a bit of
old Python 1 and Python 2 history too.
(So the sidebar in this old entry of mine is not applicable to Python 3.)
To start with, let's show the situation in Python 2:
>>> class A: ... pass >>> class B(object): ... pass >>> repr(A) '<class __main__.A at 0x7fd804cacf30>' >>> repr(B) "<class '__main__.B'>" >>> repr(type) "<type 'type'>"
Old style and new style classes in Python 2 are reported slightly
differently, but they are both 'class', while
type (or any other
built in type such as
int) are 'type'. This distinction is made
at a quite low level, as described in the sidebar in my old entry.
However, in Python 3 things have changed and
repr()'s output is
>>> class B(object): ... pass >>> repr(B) "<class '__main__.B'>" >>> repr(type) "<class 'type'>"
Both Python classes and built-in types are 'class'. This change was specifically introduced in Python 3, as issue 2565 (the change appeared in 3.0a5). The issue's discussion has a hint as to what was going on here.
To simplify a bit, in Python 1.x, there was no unification between
classes and built in types. As part of this difference, their
repr() results were different in the way you'd expect; one said
'class' and the other said 'type'. When Python 2.0 came along, it
unified types with new style classes. The initial implementation
of this unification caused
repr() to report new style classes as
types. However, at some point relatively early in 2.x development,
this code was changed to report new style classes as 'class ...'
instead. What was reported for built in types was left unchanged
for backwards compatibility with the Python 1.x output of
In the run up to Python 3, this backwards compatibility was removed
and now all built in types (or if you prefer, classes) are reported
(I was going to say something about what
type() reports, but then
I actually thought about it. In reality
type() doesn't report any
sort of string;
type() returns an object, and if you're just
running that in an interactive session the interpreter prints it
str(), which for classes is normally the same as
The reason to use '
repr(B)' instead of '
type(B)' in my interactive
example is that '
Sidebar: The actual commit message for the 2001 era change
issue 2565 doesn't quote the full commit message, and it turns out that the omitted bit is interesting (especially since it's a change made by Guido van Rossum):
Change repr() of a new-style class to say <class 'ClassName'> rather than <type 'ClassName'>. Exception: if it's a built-in type or an extension type, continue to call it <type 'ClassName>. Call me a wimp, but I don't want to break more user code than necessary.
As far as I can tell from reading old Python changelogs, this change appeared in Python 2.2a4. In a way, this is surprisingly late in Python 2.x development. The 'what's new' snippet about the change reiterates that not changing the output for built in types is for backward compatibility:
The repr() of new-style classes has changed; instead of <type 'M.Foo'> a new-style class is now rendered as <class 'M.Foo'>, except for built-in types, which are still rendered as <type 'Foo'> (to avoid upsetting existing code that might parse or otherwise rely on repr() of certain type objects).
Of course, at that point it was also for compatibility with people relying on what repr() of built in types reported in 2.0 and 2.1.
In CPython, types implemented in C actually are part of the type tree
In Python, in theory all types descend from
object (they are
direct or indirect subclasses of it). For years, I've believed (and
written) that this was not the case at the implementation level for
types written in native C code in CPython (the standard implementation
of Python and the one you're probably using). Types written in C
might behave as if they descended from
object, but I thought their
behavior was actually entirely stand-alone, implemented by each
type separately in C. Courtesy of Python behind the scenes #6:
how Python object system works,
I've discovered that I'm wrong.
In CPython, C level Python types are not literally subclasses of
the C level version of
object, because of course C doesn't have
classes and subclasses in that sense. Instead, you usually describe
your type by defining a
PyTypeObject struct for it, with
all sorts of fields that you fill in or don't fill in as you need
them, including a
for your base type (if you want more than one base type, you need
to take the alternate path of a heap type). When
CPython needs to execute special methods or other operations on
your type, it will directly use fields on your
structure (and as far as I know, it only uses those fields, with
no fallbacks). On the surface, this looks like the
is essentially decorative and is only used to report your claimed
__base__ if people ask.
However, there is a bit of CPython magic hiding behind the scenes.
In order to actually use a
PyTypeObject as a type, you must
register it and make it ready by calling
PyType_Ready. As part
PyType_Ready will use your type's
tp_base to fill
in various fields of your
PyTypeObject if you didn't already do
that, which effectively means that your C level type will inherit
those fields from its base type (and so on all the way up to
object). This is outlined in a section of the C API, but of course
I never read the C API myself because I never needed to use it.
The how [the] Python object system works
article has more details on how this works, if you're curious, along
with details on how special methods also work (which is more
interesting than I had any idea, and I've looked at this area
(The distinction between what is considered a 'type' and what is
considered a 'class' by
repr() is somewhat arbitrary; see the
sidebar here. C level things defined with
PyTypeObject will probably always be considered types instead of
Using constant Python hash functions for fun and no real profit
In one of the examples of wtfpython,
the author uses a constant
__hash__ function in order to make a
version of plain dicts and ordered dicts that can be put in a set.
When I saw this, I had some reactions.
My first reaction was to wonder if this was safe. With a lot of
qualifications, the answer is yes. Two important qualities of a
__hash__ function are that it always return the same result for
a given object and that it returns the same hash for any two objects
that potentially compare the same (see also understanding hashing
in Python). Returning a constant (here '0')
makes both trivially true, provided that your objects cannot be
equal to anything other than other instances of your class (or
classes). Returning a constant hash for instances that aren't going
to compare equal is safe, as object hashes don't have to be unique.
(This doesn't mean that you can safely mutate instances of your classes in ways that affect their equality comparison. Doing so is a great way to get two copies of the same key in a dict or a set, which is likely to be bad.)
My second reaction was to wonder if this was useful, and I think the answer is generally not really. The problem with a constant hash function is that it's going to guarantee dictionary key collisions for any such objects that you add to the dict or set. If you put very many objects with the same key into a dict (or a set), checking for a given key turns into doing an equality check on all of the other keys you've already added. Adding an entry, getting an entry, checking whether an entry is there, whatever, they all become a linear search.
If you don't have very many objects in total in a dict this is probably okay. A linear search through ten or twenty objects is not terrible (hopefully the equality check itself is efficient). Even a linear search through a hundred might be tolerable if it's important enough. But after a certain point you're going to see visible and significant slowdowns, and it would be more honest to use a list instead of a dict or set (since you're effectively getting the performance of a list).
If you need to do better, you probably want to go all of the way to implementing some sort of proper hash function that implements the rules of hashing in Python. If you're willing to live daringly, you don't have to make your objects literally immutable once created, you just have to never mutate them while they're in a dict or a set.