2024-01-30
Putting a Python executable in venvs is probably a necessary thing
When I wrote about getting the Python LSP server working with venvs in a brute force way, Ian Z aka nobrowser commented (and I'm going to quote rather than paraphrase):
I'd say that venvs themselves are "aesthetically displeasing". After all, having a separate Python executable for every project differs from having a separate LSP in degree only.
On Unix, this separate executable is normally only a symbolic link, although other platforms may differ and the venv normally will have its own copy of pip, setuptools, and some other things, which can amount to 20+ Mbytes even on Linux. However, when I thought about it, I don't think there's any good option other than for the venv to have its own (nominal) copy of Python. The core problem is that venvs are very convenient when they're more or less transparently activated.
A Python venv is marked by a special file in the root of the venv, pyvenv.cfg. There are two ways that Python could plausibly decide when to automatically activate a venv without you having to set any environment variables; it can look around the environment of the Python executable you ran for this marker (which is what it does today), or it could look around the environment of your current directory, traversing up the filesystem to see if it could find a pyvenv.cfg (in much the same way that version control systems look for their special .git or .hg directory to mark the repository root).
The problem with automatically activating a venv based on what you find in the current directory and its parents is that it makes Python programs (and the Python interpreter) behave differently depending on where you are when you run them, including random system utilities that just happen to be written in Python. If the program requires any packages beyond the standard library, it may well fail outright because those packages aren't installed in the venv, and if they are installed in the venv they may not be the version the program needs or expects. This isn't a particularly good experience and I'm pretty confident that people would be very unhappy if this was what Python did with venvs.
The other option is to not automatically activate venvs at all and always require you to set environment variables (or the local equivalent). The problem for this is that it's a terrible experience for actually using venvs to, for example, deploy programs as encapsulated entities. You can't just ship the venv and have people run programs that have been installed into its bin/ subdirectory; now they need cover scripts to set the venv environment variables (which might be automatically generated by pip or whatever, but still).
So on the whole embedding the Python interpreter seems the best choice to me. That creates a clear logic to which venv is automatically activated, if any, that can be predicted by people; it's the venv whose Python you're running. Of course I wish it didn't take all of that disk space for extra copies of pip and setuptools, but you can't have everything.
2024-01-27
Getting the Python LSP server working with venvs the brute force way
Recently I wound up doing some Django work using a Python venv, since this is the easiest way to get a self-contained Python environment that has some version of Django (or other applications) installed. However, one part of the experience was a little bit less than ideal. I normally write Python using GNU Emacs and the Python LSP server, and this environment was complaining about being unable to find Django modules to do code intelligence things with them. A little thought told me why; GNU Emacs was running my regular LSP server, which is installed through pipx into its own venv, and that venv didn't have Django installed (of course).
As far as I know, the Python LSP server doesn't have any specific support for recognizing and dealing with venvs, and in general this is a bit difficult (the version of Python being used by a venv may not even match the version that pylsp is running with; in fact this was my case, since my installed pylsp was using pypy but the Django venv was using the system Python 3). Rather than try to investigate deeply into this, I decided to solve the problem with brute force, which is to say that I installed the Python LSP server (with the right set of plugins) into the venv, along with all of the rest of things, and then ran that instance of GNU Emacs with its $PATH set to use the venv's bin/ directory and pick up everything there, including its Python 3 and python-lsp-server.
This is a little bit aesthetically displeasing for at least two reasons. First, the Python LSP server and its plugins and their dependencies aren't a small thing and anyway they're not a runtime package dependency, it's purely for development convenience. Second, the usual style of using GNU Emacs is to start it once and then reuse that single Emacs instance for everything, which naturally gives that Emacs instance a single $PATH and make it want to use a single version of python-lsp-server. I'm okay with deviating from this bit of Emacs practice, but other people may be less happy.
(A hack that deals with the second issue would be a 'pylsp' cover script that hunts through the directory tree to see if you're running it from inside a venv and if that venv has its own 'pylsp' binary; if both are true, you run that pylsp instead of your regular system-wide one. I may write this hack someday, partly so that I can stop having to remember to add the venv to my $PATH any time I want to fire up Emacs on the code I'm working on in the venv.)
2024-01-19
A Django gotcha with Python 3 and the encoding of CharFields
We have a long standing Django web application, which has been static for a long time in Python 2. I've recently re-started working on moving it to Python 3 (an initial round of work was done some time ago), and in the process of this I ran into a surprising issue involving text encoding and database text fields (a 'CharField' in Django terminology).
As part of one database model, we have a random key, which for various reasons is represented in the database model as a modest size string:
class Request(models.Model): [...] access_hash = models.CharField(..., default=gen_random_hash)
(We use SQLite as our database, which may be relevant here.)
The actual access hash is a 64-bit random value read from /dev/urandom. We could represent this in a variety of ways; for instance, I could have just treated it as a 64-bit unsigned decimal number in string form, or a 64-bit (unsigned) hex number. But for no particularly strong reason, long ago I decided to base64 encode the raw random value. Omitting error checking, the existing version of this is:
def gen_random_hash(): fp = open("/dev/urandom", "rb") c = fp.read(8) fp.close() # Trim off trailing awkward '=' character return base64.urlsafe_b64encode(c)[:-1]
(The use of "rb" as the open mode stems from my first round of updates for Python 3.)
When I ran our web application under Python 3 in testing mode and looked at the uses of this access hash, I discovered that the URLs we were generating for it in email included a couple of ''' in them. Inspection of the database table entry itself in the Django admin interface showed that the actual value for access hash was, for example (and this is literal):
b'm5AWGlUSR1c'
(0x27 is ', so I was getting "b'm5AWGlUSR1c'" in the URL when written out for email in HTML. Once I looked I could see that giveaway leading 'b' too.)
There are two things going on here. The first is that in Python
3, base64.urlsafe_b64encode
operates on bytes (which we're giving it since we read /dev/urandom
in binary mode, making c
a bytes object) and returns a bytes
object, not a string. The second thing is that when we ask Django to
store a bytes object in a CharField (possibly only as the result of
a callable default value),
Django string-ized it, yielding the b'...' form as the stored value.
At one level this is reasonable; Django is doing its best to store some random Python object into a string field, probably by just doing a str() on it. At another level, I wish Django would specifically refuse to do this conversion for bytes objects, because one Python 3 issue is definitely bytes/str confusions and this specific representation conversion is almost certainly a bad idea, unlike str()'ing things in general. Raising an exception by default would be much more useful.
The solution is to explicitly convert to a Unicode str and specify a suitable character set encoding, which for base64 can be 'ascii':
return str(base64.urlsafe_b64encode(c)[:-1], "ascii")
This causes the CharField values to look like they should, which means that URLs using the access hash no longer have ''' in them.
Hopefully there aren't any other cases of this lurking in our Django web application, but I suppose I should do some more testing and examine the database for alarming characters (which is relatively readily done with the management dumpdata command).