Wandering Thoughts

2021-01-13

Installing Pip in Python 2 environments that don't provide it already

In theory any modern version of Python 2 (or Python 3) is bundled with pip, although it may be an out of date version that you could update (with something like 'python2 -m pip install --user --upgrade pip'). In practice, some Linux distributions split pip off into its own package and have stopped providing this separate package for their version of Python 2. This is definitely the case for Fedora 32, and may soon be the case for other distributions. If you still want a Python 2 version of Pip (for example so that you can keep updating the Python 2 version of the Python language server), you need to install one by hand, somehow.

When I had to do this on my Fedora 32 machine I was lucky enough that I had already done an update of the Python 2 pip on one machine where I used '--user' to install the new version in my $HOME, so I had all of the Pip code in .local/lib/python2.7/site-packages and could just copy it over, along with .local/bin/pip2. It turns out that this simple brute force approach is probably not necessary and there is a completely convenient alternative, which is different than the situation I expected before I started writing this entry.

(Since pip is normally installed with your Python, I expected that bootstrapping pip outside of that was not very well supported because it was infrequently used. For whatever reason, this is not at all the case currently.)

The pip people have an entire document on installing pip that walks you through a number of options. The important one for my case is Installing with get-pip.py, where you download a get-pip.py Python program to bootstrap pip. One of the options it supports is installing pip as a user package, resulting in a .local/bin/pip2 for you to use. The simple command line required is:

python2 get-pip.py --user

One of the reasons this works so well is that, well, get-pip is actually pip itself (the full version, as far as I know). The comment at the start of get-pip.py explains what is going on so well that I am just going to quote it wholesale:

Hi There!
You may be wondering what this giant blob of binary data here is, you might even be worried that we're up to something nefarious (good for you for being paranoid!). This is a base85 encoding of a zip file, this zip file contains an entire copy of pip (version 20.2.4).

Pip is a thing that installs packages, pip itself is a package that someone might want to install, especially if they're looking to run this get-pip.py script. Pip has a lot of code to deal with the security of installing packages, various edge cases on various platforms, and other such sort of "tribal knowledge" that has been encoded in its code base. Because of this we basically include an entire copy of pip inside this blob. We do this because the alternatives are attempt to implement a "minipip" that probably doesn't do things correctly and has weird edge cases, or compress pip itself down into a single file.

As a sysadmin, I fully support this very straightforward and functional approach to bootstrapping pip. The get-pip.py file that results is large for a Python program, but as installers (and executables) go, 1.9 Mbytes is not all that much.

However, there is a wrinkle probably coming up in the near future. Very soon, versions of pip itself will stop supporting Python 2; the official statement (currently here) is:

pip 20.3 was the last version of pip that supported Python 2. [...]

(The current version of pip is 20.3.3.)

The expected release date of pip 21.0 is some time this month. At some time after that point, get-pip.py may stop supporting Python 2 and you (I) will have a more difficult time bootstrapping the Python 2 version of pip on any machine I still need to add it on. Of course, at some point I will also stop having any use for a Python 2 pip, because the Python language server itself will drop support for Python 2 and I won't have any reason to upgrade my Python 2 version of it.

(Pip version 21.0 should fix, or at least work around, a long stall on startup that's experienced in some Linux configurations.)

PS: What PyPy will do about this is a good question, since they are so far planning to support Python 2 for a very long time. Perhaps they will freeze and ship pip 20.3.3 basically forever.

Python2GettingPip written at 22:53:44; Add Comment

2020-12-25

In Python 3, types are classes (as far as repr() is concerned)

In yesterday's entry, I put in a little aside, saying 'the distinction between what is considered a 'type' and what is considered a 'class' by repr() is somewhat arbitrary'. It turns out that this is not true in Python 3, which exposes an interesting difference between Python 2 and Python 3 and a bit of old Python 1 and Python 2 history too.

(So the sidebar in this old entry of mine is not applicable to Python 3.)

To start with, let's show the situation in Python 2:

>>> class A:
...     pass
>>> class B(object):
...     pass
>>> repr(A)
'<class __main__.A at 0x7fd804cacf30>'
>>> repr(B)
"<class '__main__.B'>"
>>> repr(type)
"<type 'type'>"

Old style and new style classes in Python 2 are reported slightly differently, but they are both 'class', while type (or any other built in type such as int) are 'type'. This distinction is made at a quite low level, as described in the sidebar in my old entry.

However, in Python 3 things have changed and repr()'s output is uniform:

>>> class B(object):
...   pass
>>> repr(B)
"<class '__main__.B'>"
>>> repr(type)
"<class 'type'>"

Both Python classes and built-in types are 'class'. This change was specifically introduced in Python 3, as issue 2565 (the change appeared in 3.0a5). The issue's discussion has a hint as to what was going on here.

To simplify a bit, in Python 1.x, there was no unification between classes and built in types. As part of this difference, their repr() results were different in the way you'd expect; one said 'class' and the other said 'type'. When Python 2.0 came along, it unified types with new style classes. The initial implementation of this unification caused repr() to report new style classes as types. However, at some point relatively early in 2.x development, this code was changed to report new style classes as 'class ...' instead. What was reported for built in types was left unchanged for backwards compatibility with the Python 1.x output of repr(). In the run up to Python 3, this backwards compatibility was removed and now all built in types (or if you prefer, classes) are reported as classes.

(I was going to say something about what type() reports, but then I actually thought about it. In reality type() doesn't report any sort of string; type() returns an object, and if you're just running that in an interactive session the interpreter prints it using str(), which for classes is normally the same as repr(). The reason to use 'repr(B)' instead of 'type(B)' in my interactive example is that 'type(B)' is type.)

Sidebar: The actual commit message for the 2001 era change

issue 2565 doesn't quote the full commit message, and it turns out that the omitted bit is interesting (especially since it's a change made by Guido van Rossum):

Change repr() of a new-style class to say <class 'ClassName'> rather than <type 'ClassName'>. Exception: if it's a built-in type or an extension type, continue to call it <type 'ClassName>. Call me a wimp, but I don't want to break more user code than necessary.

As far as I can tell from reading old Python changelogs, this change appeared in Python 2.2a4. In a way, this is surprisingly late in Python 2.x development. The 'what's new' snippet about the change reiterates that not changing the output for built in types is for backward compatibility:

The repr() of new-style classes has changed; instead of <type 'M.Foo'> a new-style class is now rendered as <class 'M.Foo'>, except for built-in types, which are still rendered as <type 'Foo'> (to avoid upsetting existing code that might parse or otherwise rely on repr() of certain type objects).

Of course, at that point it was also for compatibility with people relying on what repr() of built in types reported in 2.0 and 2.1.

Python3TypesAreClasses written at 00:32:25; Add Comment

2020-12-24

In CPython, types implemented in C actually are part of the type tree

In Python, in theory all types descend from object (they are direct or indirect subclasses of it). For years, I've believed (and written) that this was not the case at the implementation level for types written in native C code in CPython (the standard implementation of Python and the one you're probably using). Types written in C might behave as if they descended from object, but I thought their behavior was actually entirely stand-alone, implemented by each type separately in C. Courtesy of Python behind the scenes #6: how Python object system works, I've discovered that I'm wrong.

In CPython, C level Python types are not literally subclasses of the C level version of object, because of course C doesn't have classes and subclasses in that sense. Instead, you usually describe your type by defining a PyTypeObject struct for it, with all sorts of fields that you fill in or don't fill in as you need them, including a tp_base field for your base type (if you want more than one base type, you need to take the alternate path of a heap type). When CPython needs to execute special methods or other operations on your type, it will directly use fields on your PyTypeObject structure (and as far as I know, it only uses those fields, with no fallbacks). On the surface, this looks like the tp_base field is essentially decorative and is only used to report your claimed __base__ if people ask.

However, there is a bit of CPython magic hiding behind the scenes. In order to actually use a PyTypeObject as a type, you must register it and make it ready by calling PyType_Ready. As part of this, PyType_Ready will use your type's tp_base to fill in various fields of your PyTypeObject if you didn't already do that, which effectively means that your C level type will inherit those fields from its base type (and so on all the way up to object). This is outlined in a section of the C API, but of course I never read the C API myself because I never needed to use it. The how [the] Python object system works article has more details on how this works, if you're curious, along with details on how special methods also work (which is more interesting than I had any idea, and I've looked at this area before).

(The distinction between what is considered a 'type' and what is considered a 'class' by repr() is somewhat arbitrary; see the sidebar here. C level things defined with PyTypeObject will probably always be considered types instead of classes.)

CPythonCTypesHaveTree written at 00:05:53; Add Comment

2020-12-23

Using constant Python hash functions for fun and no real profit

In one of the examples of wtfpython, the author uses a constant __hash__ function in order to make a version of plain dicts and ordered dicts that can be put in a set. When I saw this, I had some reactions.

My first reaction was to wonder if this was safe. With a lot of qualifications, the answer is yes. Two important qualities of a __hash__ function are that it always return the same result for a given object and that it returns the same hash for any two objects that potentially compare the same (see also understanding hashing in Python). Returning a constant (here '0') makes both trivially true, provided that your objects cannot be equal to anything other than other instances of your class (or classes). Returning a constant hash for instances that aren't going to compare equal is safe, as object hashes don't have to be unique.

(This doesn't mean that you can safely mutate instances of your classes in ways that affect their equality comparison. Doing so is a great way to get two copies of the same key in a dict or a set, which is likely to be bad.)

My second reaction was to wonder if this was useful, and I think the answer is generally not really. The problem with a constant hash function is that it's going to guarantee dictionary key collisions for any such objects that you add to the dict or set. If you put very many objects with the same key into a dict (or a set), checking for a given key turns into doing an equality check on all of the other keys you've already added. Adding an entry, getting an entry, checking whether an entry is there, whatever, they all become a linear search.

If you don't have very many objects in total in a dict this is probably okay. A linear search through ten or twenty objects is not terrible (hopefully the equality check itself is efficient). Even a linear search through a hundred might be tolerable if it's important enough. But after a certain point you're going to see visible and significant slowdowns, and it would be more honest to use a list instead of a dict or set (since you're effectively getting the performance of a list).

If you need to do better, you probably want to go all of the way to implementing some sort of proper hash function that implements the rules of hashing in Python. If you're willing to live daringly, you don't have to make your objects literally immutable once created, you just have to never mutate them while they're in a dict or a set.

ConstantHashFunctions written at 00:27:47; Add Comment

2020-11-11

Logging fatal exceptions in my Python programs is not enough

We have a few Python programs which run automatically, need to produce very rigid output (or lack of output) to standard output and even standard error, and are complex enough (and use enough outside code) that they may reasonably run into unhandled exceptions. One example is our program to report on email attachment type information under Exim; this runs a lot of code on untrusted input, and our Exim configuration expects its output to have a pretty rigid format (cf). Allowing Python to dump out the normal unhandled exception to standard error is not what we wanted. So for years that program has had a chunk of top level code to catch and syslog otherwise unhandled exceptions. I wrote it, deployed it, and considered it all good.

The other day I discovered that this program had been periodically experiencing, catching, and dutifully syslogging an exception about an internal error (caused by a package we use), going back months. In fact, more than one error about more than one thing. I hadn't known, because I don't normally go look through the logs for these exception traces. Why would I? They aren't supposed to happen and they mostly don't happen, and humans are very bad at consistently looking for things that don't happen.

Django has a very nice feature where it will email error reports to you, which has periodically been handy here. I'm not sure I trust myself to write that much code that absolutely must run, but I certainly could make my exception logging code also run an external script with very minimal arguments and that script could email me to notify me. Since the exception is being logged, I don't need a copy in email; I just need to know that I should go look at the logs.

(Django emails the whole exception along with a bunch of additional information, but I believe the email is the only place that information is captured. There are various tradeoffs here, but my starting point is that I'm already logging the exception.)

I could likely benefit from going through PyPI to see how other people have solved this particular problem, and maybe even use their code rather than write my own. I've traditionally avoided outside packages, but we're already using a bunch of them in this program as it is and I should probably get over that hangup in general.

(It helps that I'm slowly acquiring a better understanding of using pip in practice.)

ExceptionNotificationNeed written at 00:45:46; Add Comment

2020-11-04

In Python, using the logging package is part of your API, or should be

We have a Python program for logging email attachment type information. As part of doing this, it wants to peer inside various sorts of archive types to see what's inside of them, because malware puts bad stuff there. One of the Python modules we use for this is the Ubuntu packaged version of libarchive-c, which is a Python API for libarchive. Our program prints out information in a very specific output format, which our Exim configuration then reads and makes use of.

Very recently, I was looking at our logs for an email message and noticed that it had a very unusual status report. Normal status reports look like this:

1kX88D-0004Mb-PR attachment application/zip; MIME file ext: .zip; zip exts: .iso

This message's status report was:

Pathname cannot be converted from UTF-16BE to current locale.

That's not a message that our program emits. It's instead a warning message from the C libarchive library. However, it is not printed out directly by the C code; instead this report is passed up as an additional warning attached to the results of library calls. It is libarchive-c that is deciding to print it out, in a general FFI support function. More specifically, libarchive-c is deciding to 'log' it through the Python logging package; the default logging environment then prints it out to standard error.

(Our program does not otherwise use logging, and I had no idea it was in use until I tried to track this down.)

A program's output is often part of its API in practice. When code does things that in default conditions produces output, this alters the API of the program it is in. This should not be done casually. If warning information should be exposed, then it should be surfaced through an actual API (an accessible one), not thrown out randomly. If your code does use logging, this should be part of its documented API, not stuffed away in a corner as an implementation detail, because people will quite reasonably want to know this (so they can configure logging in general) and may want to turn it off.

In a related issue, notice that libarchive-c constructs the logger it will use at import time (here), before your Python code normally will have had a chance to configure logging, and will even use it at import time under some circumstances (here and here), as it is dynamically building some bindings. I suspect that it is far from alone as far as constructing and even using its logger at import time goes.

(It's natural to configure logging as part of program startup, in a main() function or something descending from it, not at program load time before you start doing imports. This is especially the case since how you do logging in a program may depend on command line arguments or other configuration information.)

(This is the background for this tweet of mine.)

LoggingPackageAndYourAPI written at 23:45:30; Add Comment

2020-11-01

Python's global statement and imports in functions

Python programmers are familiar with the global statement, which is how Python lets you assign to global variables inside functions (otherwise any variable that's assigned to is assumed to be a local variable). Well, that's not quite what global does.

In languages like C, global variables must exist before you can use them in a function. In common Python usage of global, the variable is created at the module (global) level and then assigned to inside a function, in the rough analog of the C requirement:

aglobal = False
def enable_thing():
  global aglobal
  aglobal = True

There are good reasons to always create the variables at the module level, but Python does not actually require that you do this. You can actually create a new module level variable inside a function using global:

def set_thing():
  global a_new_name
  a_new_name = <something>

(If you read between the lines of the language specification, you can see that this is implied.)

Now, suppose that you want to import a another module as part of initializing some things, but not do it when your module is import'ed (for example, you might be dealing with a module that can be very slow to import). It turns out that you can do this; with global you can import something for module-wide use inside a function. The following works:

def import_slowmodule():
  global slowmodule
  import slowmodule

def use_slowmodule():
  slowmodule.something()

If you do import inside a function, it normally binds the imported name only as a function local thing (as import defines names in the local scope). However, global changes that; when the module's name (or whatever you're importing it as) is set as a global identifier, import binds the name at the module level.

(The actual CPython bytecode does imports in two operations; there is an IMPORT_NAME and then some form of STORE_* bytecode, normally either STORE_FAST or, with a global in effect, STORE_NAME.)

This is sufficiently tricky and clever that if you need to use it, I think you should put a big comment at the top of the file to explain that there is a module that is conditionally imported at the module level that is not visible in your normal import list. Otherwise, sooner or later someone is going to get rather confused (and it may be a future you).

GlobalAndImports written at 23:45:57; Add Comment

2020-10-29

An illustration of why running code during import is a bad idea (and how it happens anyway)

It's a piece of received wisdom in Python programming that while you can make your module run code when it's import'd, you normally shouldn't. Importing a module is supposed to be both fast and predictable, doing as little as possible. But this rule is not always followed, and when it's not followed you can get bad results:

If you've remotely logged in to a Fedora machine (and have no console session there) and the python3-keyring package is installed, 'python3 -c "import keyring"' takes 25 seconds or so as the module tries to talk to keyrings on import and waits for some long timeouts. Nice work.

(The keyring module (also) provides "an easy way to access the system keyring service".)

On the one hand this provides yet another poster child of why running code on import is very bad, since merely importing a module should clearly not stop your Python program for 25 seconds. On the other hand, I think that this case makes an interesting illustration of how it is possible to drift into this state through a reasonably sensible API choice.

Keyring has a notion of backends, which actually talk to the various different system keyring services. To use keyring, you need to pick a backend to use and initialize it, and by 'you' we mean 'keyring', because people calling keyring just want to use a generic API without having to care what backend is in use on this system. So when you import the keyring module, core.py picks and initializes a backend during the import:

# init the _keyring_backend
init_backend()

Automatically selecting and initializing a backend on import means that keyring's API is ready for callers to use right away without any further work. This is a friendly API, but assumes that everyone who imports keyring will go on to use it. While this sounds reasonable, a Python program may only need to talk to the keyring for some operations under some circumstances, and may mostly never use it. One such program is pip, which needs the keyring only rarely but imports it all of the time.

(Unconditional imports are the obvious and Pythonic thing to do. People look at you funny if your program does 'import' in a function or a class, and it's harder to use the result.)

However, selecting the backend on import has a drawback, at least on Linux, which is that keyring has to figure out which system keyring services are actually active right now, because in the Linux way there's more than one of them (keyring supports SecretStorage and direct use of KWallet, plus third party plugins). Since keyring has decided to choose the backend it will use at import time, it has to determine which of its supported system keyring services are active at import time.

Some of keyring's backends determine whether or not the corresponding system service is active by trying to make a DBus connection to the service. Under the right (or the wrong) circumstances, this DBus action can stall for a significant amount of time. For instance, you can see this in the kwallet backend code; it attempts to get the DBus object /modules/kwalletd5 from org.kde.kwalletd5. Under some circumstances, this DBus action can fail only after a long timeout, and now you have a 25 second import delay.

This import delay isn't a simple case where the keyring module is running a bunch of heavyweight code. Instead keyring is doing a potentially dangerous operation by talking to an outside service during import. It's not necessarily obvious that this is happening, because you need to understand both what happens in a specific backend and what's done at import time (and in isolation each piece sounds sensible). And a lot of time talking to the outside service will either work fine and be swift, or will fail immediately.

ImportTimeCodeStall written at 00:54:21; Add Comment

2020-10-28

An issue with Pip installed packages and Python versions (on Unix)

Suppose, not hypothetically, that you want to install pyls, a LSP server for Python, so that you can use it with (for example) GNU Emacs' lsp-mode. Pyls is probably not packaged for your Unix (it's not for Fedora or Ubuntu), but you can install it with Pip (since it's on PyPi), either as 'sudo pip install' to install it system wide (which may conflict with your package manager) or as 'pip install --user' to install it just for you.

(If this is a shared Unix machine, you probably need to do the latter.)

Then you upgrade your Unix version (or it gets upgraded), for example from Fedora 31 to Fedora 32. Suddenly the pyls program doesn't work any more and even more puzzlingly, 'pip list --user' doesn't even list anything. It's as if your personal installation of pyls was somehow wiped out by the upgrade.

What's going is that pip installs things under a path that is specific to the minor version of Python, and when the minor version changes in the upgrade, the new version of Python doesn't find your old packages because it's looking in a different place. Fedora 31 had Python 3.7, which expects to find your personal packages in ~/.local/lib/python3.7/site-packages, where pip put them for you. Fedora 32 has Python 3.8, which expects to find the same packages in ~/.local/lib/python3.8/site-packages, and ignores the versions in python3.7/site-packages.

(The same thing happens on Ubuntu, where 18.04 LTS has 3.6.9 and 20.04 LTS has 3.8.5.)

As far as I can see there is no good way out of this. The same thing happens if you install things system wide with 'sudo pip install' (and I hope you kept notes on what you installed through pip and what was already put there by the system). I think that it also happens if you put pyls into a venv because venvs normally use the system Python and inherit this version specific site-packages directory.

(There is a 'python3 -m venv --upgrade <dir>' venv command to upgrade the version of Python in a venv, but looking at the code suggests that it doesn't do anything to migrate installed packages to the new version. I can't test this, though, so perhaps I'm missing something.)

My personal solution was to just rename the ~/.local/lib/python3.7 directory to 'python3.8'. Pip seems happy with the result, as does pyls. The more correct approach is probably to restart from scratch and reinstall all packages and programs like pyls.

(This elaborates on a tweet of mine. At the time of the tweet I hadn't realized that this applies to basically all uses of pip to install things, not just 'pip --user'.)

PipPythonVersionIssue written at 00:41:52; Add Comment

2020-10-26

Fifteen years of DWiki, the Python engine of Wandering Thoughts

DWiki, the wiki engine that underlies Wandering Thoughts (this blog), is fifteen years old. That makes it my oldest Python program that's in active, regular, and even somewhat demanding use (we serve up a bunch of requests a day, although mostly from syndication feed fetchers and bots on a typical day). As is usual for my long-lived Python programs, DWiki's not in any sort of active development, as you can see in its github repo, although I did add a an important feature just last year (that's another story, though).

DWiki has undergone a long process of sporadic development, where I've added important features slowly over time (including performance improvements). This sporadic development generally means that I come back to DWiki's code each time having forgotten much of the details and have to recover them. Unfortunately this isn't as easy as I'd like and is definitely complicated by historical decisions that seemed right at the time but which have wound up creating some very tangled and unclear objects that sit at the core of various important processes.

(I try to add comments for what I've worked out when I revisit code. It's probably not always successful at helping future me on the next time through.)

DWiki itself has been extremely stable in operation and has essentially never blown up or hit an unhandled exception that wasn't caused by a very recent code change of mine. This stability is part of why I can ignore DWiki's code for long lengths of time. However, DWiki operates in an environment where DWiki processes are either transient or restarted on a regular basis; if it was a persistent daemon, more problems might have come up (or I might have been forced to pay more attention to reference leaks and similar issues).

Given that it's a Unix based project started in 2005, Python has been an excellent choice out of the options available at the time. Using Python has given me long life, great stability in the language (since I started as Python 2 was reaching stability and slowing down), good enough performance, and a degree of freedom and flexibility in coding that was probably invaluable as I was ignorantly fumbling my way through the problem space. Even today I'm not convinced that another language would make DWiki better or easier to write, and most of the other options might make it harder to operate in practice.

(To put it one way, the messy state of DWiki's code is not really because of the language it's written in.)

Several parts of Python's standard library have been very useful in making DWiki perform better without too much work, especially pickle. The various pickle modules make it essentially trivial to serialize an object to disk and then reload it later, in another process, which is at the core of DWiki's caching strategies. That you can pickle arbitrary objects inside your program without having to make many changes to them has let me easily add pickle based disk caches to various things without too much effort.

At the same time, the very strong performance split in CPython between things implemented in C and things implemented in Python has definitely affected how DWiki is coded, not necessarily for the better. This is particularly obvious in the parsing of DWikiText, which is almost entirely done with complex regular expressions (some of them generated by code) because that's by far the fastest way to do it in CPython. The result is somewhat fragile in the face of potential changes to DWikiText and definitely hard for me to follow when I come back to it.

(With that said, I feel that parsing all wikitext dialects is a hard problem and a high performance parser is probably going to be tricky to write and follow regardless of the implementation language.)

DWiki is currently written in Python 2, but will probably eventually be ported to Python 3. I have no particular plans for when I'll try to do that for various reasons, although one of the places where I run a DWiki instance will probably drop Python 2 sooner or later and force my hand. Right now I would be happy to leave DWiki as a Python 2 program forever; Python 3 is nicer, but since I'm not changing DWiki much anyway I'll probably never use many of those nicer things in it.

DWikiFifteenYears written at 00:14:28; Add Comment

(Previous 10 or go back to October 2020 at 2020/10/19)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.