Wandering Thoughts

2020-06-29

Adapting our Django web app to changing requirements by not doing much

We have a Django web application to handle (Unix) account requests, which is now nine years old. I've called this utility code, but I mentioned recently that over that time there have been some changes in how graduate students were handled that needed some changes in the application. Except not very much change was necessary, in some ways, and in other ways the changes are hacks. So here are some stories of those changes.

When we (I) initially wrote the web application, our model of how new graduate students got Unix accounts was straightforward. All graduate students were doing a thesis (either a Masters or a PhD) and so all of them have a supervising professor. As a long standing matter of policy, that supervisor was their account sponsor, and so approved their account request. Professors can also sponsor accounts for other people associated with them, such as postdocs.

(This model already has a little glitch; some students are co-supervised by more than one professor. Our system requires one to be picked as the account sponsor, instead of somehow recording them as co-sponsored, which has various consequences that no one has complained about so far.)

The first change that showed up was that the department developed a new graduate program, the Master of Science in Applied Computing. Graduate students in the MScAC program don't write a thesis and as a result they don't have a supervising professor. As it happened, we already had a model for solving this, because Unix accounts for administrative and technical staff are not sponsored by professors either; they have special non-professor sponsors. So we added another such special sponsor for MScAC students. This was not sufficient by itself, because the account request system sometimes emails new graduate students and the way those messages were written assumed that the student's sponsor was supervising them.

Rather than develop a general solution to this, we took the brute force solution of an '{% if ...}' condition in the relevant Django template. Because of how our data is set up, this condition both has to reach through several foreign keys and uses a fixed text match against a magic name, instead of checking any sort of flag or status marker (because no such status marker was in the original data model). Fortunately the name it matches against is not exposed to people, because the official name for the program has actually changed over time but our internal name has never been updated (partly because it was burned into the text template). This is a hack, but it works.

The second change is that while all graduate students must eventually get a specific supervisor, not all of them have one initially when they arrive. In particular, there is one research group that accepts most new graduate students collectively and then sorts out who they will be supervised later, once the graduate students know more about the group and their own interests. In the past, this had been solved artificially by assigning nominal sponsors immediately even if they weren't going to be the student's supervisor, but eventually the group got tired of this and asked us to do better. The solution here was similar to the MScAC program (and staff accounts); we invented a synthetic 'supervisor' for them, with a suitable generic name. Unlike with the MScAC program, we didn't customize the Django templates for this new situation, and unfortunately the result does look a little ugly and awkward.

(This is where a general solution would have been useful. If we were templating this from a database table or the like, we could have just added a new entry for this general research group case. Adding another Django '{% if ...}' to the template would have made it too tangled, so we didn't.)

I don't think we did anything clever in our Django application's code or its data model. A lot of the changes we were able to make were inherent in having a system that was driven by database tables and being able to add relatively arbitrary things to those tables (with some hacks involved). Where our changes start breaking down is exactly where the limitations of that start appearing, such as multiple cases in templates when we didn't design that into the database.

(Could we have added it later? Perhaps. But I've always been too nervous about database migrations to modify our original database tables, partly because I've never done one with Django. This is a silly fear and in some ways it's holding back the evolution of our web application.)

PS: You might think that properly dealing with the co-supervision situation would make the research group situation easy to deal with, by just having new graduate students 'co-sponsored' by the entire research group. It's actually not clear if this is the right answer, because the situations are somewhat different on the Unix side. When you actively work with a supervisor, you normally get added to their Unix group so you can access group-specific things (if there are any), so for co-supervisors you should really get added to the Unix groups for both supervisors. However, it's not clear if people collectively sponsored by a research group should be added to every professor's Unix group in the same way. This implies that the Django application should know the difference between the two cases so that it can signal our Unix account creation process to treat them differently.

Sidebar: Our name hack for account sponsors

When someone goes to our web page to request an account, they have to choose their sponsor from a big <select> list of them. The list is sorted on the sponsor's last name, to make it easier to find. The idea of 'first name' and 'last name' is somewhat tricky (as is their order), and automatically picking them out from a text string is even harder. So we deal with the problem the other way around. Our Django data model has a 'first name' and a 'last name' field, but what they really mean is 'optional first part of the name' and 'last part of the name (that will determine the sort order)'.

As part of this, the synthetic account sponsors generally don't have a 'first name', because we want them to sort in order based on the full description (such as 'MScAC Graduate Student', which sorts in M not G or S).

(Sorting on 'last name' is somewhat arbitrary, but part of it is that we expect people requesting accounts to be more familiar with the last name of their sponsor than the first name.)

DjangoAppAdaptations written at 01:09:47; Add Comment

2020-06-28

Understanding why Django's goals are not our goals for our web application

A while back I wrote about how Django's goals are probably not our goals for our web application, but at the time, I didn't have a succinct way of talking about why this was the case. Recently I wrote about a realization I'd come to about product code and utility code, where product code is used as part of delivering your business but utility code sits in the background doing other things. That realization gives me a better way to talk about Django and us.

Right from its beginning as a newspaper's publishing platform, Django has been product code and been used for product code. Probably most large sized Python projects (such as Twisted) see themselves this way and are often used this way, with people building big projects that support the business on top of them (after all, you rarely build big projects if you don't need them). As direct and indirect product code, Django is constantly evolving as the needs of people's businesses pull it in various directions. Django mostly has good API stability, but this stability is to enable people with product code that use Django to move faster.

Our Django based Unix account request handling system is not product code, it's utility code. The business rules and processes for authorizing new accounts are set by policies that are extremely stable, and the department doesn't operate in a way where we suddenly change the sort of accounts that we set up. How the department teaches and what sort of programs it offers have changed (although slowly), but that's as far as it goes.

(Looking back, there actually have been some modest policy changes about some aspects of incoming graduate students. We've patched around these in the account request system in some simple ways, which is actually an interesting story of flexibility and adaptation. But the fundamental ideas of who can have an account here and who decides that haven't changed. University departments are like that, and unlike businesses.)

As we've found out, basing utility code on top of product code is not a great path to happiness. This isn't really surprising since the two are pulling in different directions; utility code wants to be static, while product code needs to evolve as business activities do. Django has done a good job of being stable (in its API) despite that, but there is still work to keep up with it (beyond the shift to Python 3), and that work is not what utility code wants.

DjangoIsProductCode written at 00:05:45; Add Comment

2020-05-24

A cheatsheet for Python's pip for how I use it

To save me having to look up or try to remember the various pip arguments and usage the next time I need to do something like update the pyls Python LSP server, here is a cheatsheet for how I use pip.

First, I always use pip with a 'user' install (the --user argument), which installs things in $HOME/.local. On my machines, pip puts binaries in .local/bin and installed Python packages in .local/lib/pythonX.Y; some might appear in .local/libexec if they had compiled portions, but I'm not sure. This is also where running a setup.py with --user puts things, which is unsurprising (I install Django test versions this way).

To install something, the basic usage is 'pip install --user <package>'. Once packages are installed, I can check for what packages have updates available with 'pip list --user --outdated'. To update a package, it's 'pip install --user --upgrade <package>'. I'm not sure what happens if you leave out the --upgrade.

(Plain 'pip list --user' lists what you have installed and leaves out checking for updates.)

Now that I've looked it up, removing a package is done with 'pip uninstall <package>'. There is 'pip check' to see if all your dependencies are fine, but this has potentially confusing output because it has no '--user' argument and so apparently checks both your packages and the system installed packages; on Ubuntu, the system packages may not have dependencies that 'pip check' is happy with. Similarly, 'pip uninstall' has no --user argument and will happily try to remove system packages instead of your own packages. Also, I don't think removing packages warns you about breaking dependencies.

Really there isn't much to my pip usage and I probably don't normally need a cheatsheet. But sometimes I don't deal with this level of Python stuff for long enough that it starts dropping out of my memory.

(So far, my only use of pip is to keep python-language-server up to date, and I don't necessarily remember to check and update it on a regular basis.)

PipCheatsheetForMe written at 21:11:41; Add Comment

2020-05-20

How I work on Python 2 and Python 3 with the Python Language Server (in GNU Emacs)

Python is one of the programming languages that I usually edit in GNU Emacs. These days, that means using the Language Server protocol through lsp-mode and the pyls Python language server. Back when I first set this up and wrote early notes on using LSP-based editing of Python, I had not solved the problem of wanting to edit both Python 3 and Python 2 based code in my GNU Emacs sessions.

If you want to do this, it turns out to be important to run either the Python 2 pyls or the Python 3 pyls, depending on whether the file you're editing is written in Python 2 or Python 3. This creates several problems that I had to solve, and eventually did with brute force (if I used Python virtual environments, it probably would be easier).

First, it's obviously necessary to install both versions at once. I install pyls into $HOME/.local/ by using pip's '--user' switch, so I created a .local/bin/py2-pyls subdirectory and manually moved the Python 2 version of pyls from .local/bin into it. This requires me to always update the Python 2 version of pyls before the Python 3 version, which is a bit annoying, but that's life in a world of both versions.

To pick the right version of pyls to run, I use a cover script; the cover script uses various heuristic checks to try to figure out if it's being run in a directory with Python 2 or Python 3 code (it doesn't work for the case of mixed code and I'm not sure that would work in general anyway). I don't have $HOME/.local/bin on my $PATH and GNU Emacs will conveniently just try to run 'pyls' to start the Python LSP server, so I put the cover script in my $HOME/bin.

The most important checks the cover script uses are to look for which version of Python seems to be being run by '#!' lines in any *.py files in the current directory, and whether there are any obvious 'print' statements (which indicate Python 2). If you're going to do this, note that some Python programs are installed with their '#!' line being '#!/usr/bin/env python3' or the like, instead of directly running the Python interpreter. I missed this in the first version of the cover script because all of our Python scripts directly use '#!/usr/bin/python3'.

(Also, the default Python version for my cover script is Python 3, because Python 3 is what I'm writing all my new Python code in.)

All of this is basically a hack but it works pretty well for me, especially in combination with how I'm dealing with my Python indentation problem. The result is a pretty seamless LSP-based Python editing experience in GNU Emacs where everything basically works. I'm not sure I'm sold on the whole LSP-based experience for editing Python, but that's not the fault of my hacks.

Python2And3LanguageServer written at 23:44:07; Add Comment

2020-04-30

The afterlife of Python 2

Python 2 is officially dead now (cf), with the python.org release of 2.17.18, the last release of Python 2 (also). That means that what happens now is its afterlife, because Python 2 being 'dead' doesn't mean that it's gone, and Python 2 is shaping up to have quite an active and extended afterlife.

Most obviously, various current Unix distributions have versions of Python 2 in them, including the just released Ubuntu 20.04 LTS. This means that Ubuntu will be supporting Python 2 through early 2025 (as much as they support anything, but it probably means making security patches, which can be borrowed by other people). Red Hat is similarly supporting the Python 2 in RHEL 8 until June 2024 (per here). Debian's situation with current and future support of Python 2 is not entirely clear, but my impression of the situation is that on the one hand the Debian Python team wants to drop it but on the other hand other people may step up to support the basic Python 2 interpreter as a Debian package and so keep it available even beyond early 2025.

Beyond that, as pointed out in The final Python 2 release marks the end of an era (via), this is merely the final release of the main CPython implementation of Python 2. PyPy, probably the most major alternate Python implementation, has said that they will be supporting Python 2 for as long as the project exists (also). Since a great deal of the Python standard library is written in Python, it's likely that any security fixes for it that PyPy makes could be readily adopted into CPython (and vice versa, while people continue to support CPython). There's also Tauthon, a fork of CPython 2. I'm not all that interested in its backports of new Python 3 features, which I wrote about back when it was 'Python 2.8', but I'd be perfectly happy to use it as a well supported way to keep having a 'python2' after 2024.

As a practical matter I expect that Python 2 code will be running for at least a decade more in various places, and people will find some way to run it even if it's building Python 2.17.18 from source themselves. Hopefully this will mostly be in places where the security of (C)Python isn't any more relevant than the security properties of a C compiler (or people switch to PyPy), but I'm not counting on that.

(For a start, I wonder how many people are in the same situation with Django applications as we are with ours, where it works fine with Python 2 but lacks the tests and other things necessary to be confident about moving it to Python 3.)

Python2Afterlife written at 21:08:39; Add Comment

2020-04-29

Dealing with my worries about Django and HTTP Basic Authentication

Last year, I attempted to upgrade our Django web app from Django 1.10 to 1.11, but ran into rather mysterious CSRF validation failures that caused me to revert back to 1.10. We have stayed there since, and the potential for this issue resurfacing has been a major blocker for moving to more recent Django versions or porting it to Python 3. I was lucky that moving from Django 1.10 to 1.11 required neither significant code changes nor a database migration, and so could be rolled back; a more major update would have left us basically marooned and having to debug Django itself, in production.

Ever since then, one of the growing focuses of my suspicions has been an interaction between our use of Apache HTTP Basic Authentication and Django's idea of a 'session' in the face of that (which apparently interacts with its CSRF protection). I rather suspect that not very many people use HTTP Basic Authentication with Django, which would allow bugs to linger here undetected for some time, and certainly Django having to magically materialize fake sessions for users authenticated this way seems like a potential source of fun problems.

Even reproducing the problem is, well, a problem, because it only manifests in a full production setup; setting up an entire duplicate of our app that runs under Django 1.11 is more than a bit tricky and I haven't done it so far. While thinking about this recently, though, it belatedly struck me that I probably don't need to go as far as a copy of our web app. If there really is a general CSRF validation issue when you're running under HTTP Basic Authentication, it should reproduce with a very basic Django app that just displays a form for you to update. I should be able to put that together fairly easily and it's easy to run it beside our web app since they have nothing to do with each other (unlike two copies of the real app, which if left alone would be trying to interact with the same things).

If I can reproduce our problems with a test app under Django 1.11, this gives me a path forward to Python 3 and a current version of Django because I can finally have confidence that we won't run into this issue in a big upgrade. Or I can find out in advance that we will run into the issue, which is much better than doing a lot of work, rolling forward, and then having it blow up badly.

(I have more thoughts about the future of our web app, but they don't fit in the margins of this entry. The short version is that we'd like to not have to do anything to our web app, but that doesn't seem very viable, cf.)

DjangoBasicAuthWorry written at 01:03:48; Add Comment

2020-03-29

I set up Python program options and arguments in a separate function

Pretty much every programming language worth using has a standard library or package for parsing command line options and arguments, and Python is no exception; the standard for doing it is argparse. Argparse handles a lot of the hard work for you, but you still have to tell it what your command line options are, provide help text for things, and so on. In my own Python programs, I almost always do this setup in a separate function that returns a fully configured argparse.ArgumentParser instance.

My standard way of writing all of it looks like this:

def setup():
  p = argparse.ArgumentParser(usage="...",
                              ....)
  p.add_argument(...)
  p.add_argument(...)

  return p

def main():
  p = setup()
  opts = p.parse_args()
  ...

I don't like putting all of this directly in my main() because in most programs I write, this setup work is long and verbose enough to obscure the rest of what main() is doing. The actual top level processing and argument handling is the important thing in main(), not the setup of options, so I want all of the setup elsewhere where it's easy to skip over. In theory I could put it at the module level, not in a function, but I have a strong aversion to running code at import time. Among other issues, if I got something wrong I would much rather have the stack trace clearly say that it's happening in setup() than something more mysterious.

Putting it in a function that's run explicitly can have some advantages in specialized situations. For instance, it's much more natural to use complex logic (or run other functions) to determine the default arguments for some command line options. For people who want to write tests for this sort of thing, having all of the logic in a function also makes it possible to run the function repeatedly and inspect the resulting ArgumentParser object.

(I think it's widely accepted that you shouldn't run much or any code at import time by putting it in the top level. But setting up an ArgumentParser may look very much like setting up a simple Python data structure like a map or a list, even though it's not really.)

ArgparseSetupWhere written at 00:22:07; Add Comment

2020-03-03

One impact of the dropping of Python 2 from Linux distributions

Due to uncertainty over the future of the Python 2 interpreter in future Linux distributions, I've been looking at some of our Python 2 code, especially the larger programs. This caused me to express some views over on Twitter, which came out long enough that I'm recycling them here with additional commentary:

Everyone's insistence on getting rid of Python 2 is magically transforming all of this perfectly functional and useful Python 2 code we have from an asset to a liability. You can imagine how I feel about that.

Functioning code that you don't have to maintain and that just works is an asset; it sits there, doing a valuable job, and requires no work. Code that you have to do significant work on just so that it doesn't break (not to add any features) is a liability; you have to do work and inject risk and you get nothing for it.

Some code is straightforward to lift to Python 3 because it doesn't do anything complicated. Some code is not like that:

Today's 'what am I going to do about this' Python 2 code is my client implementation of the Sendmail milter protocol, which is all about manipulating strings as binary over a network connection. I guess I shotgun b"..." and then start guessing.

My milter implementation has been completely stable since written in Python 2 in 2011. Now I have to destabilize it because people are taking Python 2 away.

(I do not have tests. Tests would require another milter implementation that was known to be correct.)

(What I meant by the end of the first tweet is making various strings into bytestrings, especially protocol literals, and trying to push that through the protocol handling.)

As a side note, testing protocol implementations is hard when you don't have some sort of reference version that you can embed in your tests, even if you implement both the client and the server side. Talking to yourself doesn't insure that you haven't made some mistake, either in the initial implementation or in a translation into the Python 3 world of bytestrings and Unicode strings and trying to handle network IO in that world and so on.

(For instance, since UTF-8 can encode every codepoint you can put into a Unicode string, including control characters and so on, you could write an encoder and decoder that actually operated on Unicode strings without you realizing, then have Python 3's magic string handling convert them to UTF-8 over the wire as you sent them back and forth between yourself during tests. Your implementation would talk to itself, but not to any outside version that did not UTF-8 encode what were supposed to be raw bytes. You could even pass tests against golden pre-encoded protocol messages if they were embedded in your Python test code and you forgot that you needed to turn them into bytestrings.)

I also had an opinion on the idea that we've known this for a while and it's just a cost of using Python:

Python 2 is only legacy through fiat (multiple fiats, both the main CPython developers and then OS distributions). Otherwise it is perfectly functional and almost certainly completely secure, and would keep running fine for a great deal longer.

Just because software is not being updated doesn't mean that it stops working. If people would leave Python 2 alone (and keep it available in Linux distributions as a low-support or unsupported package, like so many others), it would likely keep going on fine for years, but because they won't, our Python 2 code is steadily being converted from an asset to a liability. Of course, part of the fun is that we don't even know for sure if people will be getting rid of the Python 2 interpreter itself, much less a timetable for it.

(Maybe the current statements from Debian and Ubuntu are supposed to answer that question, but if so they're not clear to me and they certainly don't give a timeline for when the Python 2 interpreter itself will be gone.)

PS: All of this is completely separate from the virtues of Python 3 for new code, where I default to it over some other options in our environment.

Python2DroppingImpact written at 21:36:15; Add Comment

2020-02-04

What 'is' translates to in CPython bytecode

The main implementation of Python, usually called CPython , translates Python source code into bytecode before interpreting it. How this translation happens can make some things fast, such as how local variables are implemented. When I wrote in yesterday's entry that having 'is' as a keyword can make it faster than if it was a built-in function because as a keyword it doesn't have to be looked up all the time just in case you changed it, I wondered how CPython actually translated 'a is b' to bytecode. The answer turns out to be somewhat more interesting than I expected.

(Bytecode can be most conveniently inspected with the dis module, and the module's documentation helpfully explains a fair bit about what the disassembled representation means.)

Let's define a little function:

def f(a):
   return a is 10

Now we can disassemble this with 'dis.dis(f.__code__)' and get:

2   0 LOAD_FAST      0 (a)
    2 LOAD_CONST     1 (10)
    4 COMPARE_OP     8 (is)
    6 RETURN_VALUE

CPython bytecodes can have an auxiliary value associated with them (shown here as the rightmost column, along with their meaning for the particular bytecode operation). Rather than have separate bytecodes for different comparison operators, all comparisons are implemented with a single bytecode, COMPARE_OP, that picks which comparison to do based on the auxiliary value. The 'is' comparison is just the same as any other; if we used 'return a > 10' in our function, the only difference in the bytecode would be the auxiliary value for COMPARE_OP (it would become 4 instead of 8).

The next obvious question to ask is how 'is not' is implemented, and the answer is that it's another comparison type. If we change our function to use 'is not', the only change is this:

    4 COMPARE_OP     9 (is not)

CPython has one last trick up its sleeve. If we write 'not a is 10', CPython specifically recognizes this and rather than translating it as a COMPARE_OP followed by a UNARY_NOT, translates it straight into the 'is not' comparison. This isn't a general transformation, for various reasons; 'return not a > 10' won't be similarly translated to the bytecode equivalent of 'return a <= 10'.

(CPython does go the extra distance to translate 'not a is not 10' into 'a is 10'. I'm a little bit surprised, since I wouldn't expect people to write that very often.)

PS: One advantage of 'is' being a keyword is that it allows CPython to do this transformation, since CPython always knows what 'is' does here. It wouldn't be safe to transform a hypothetical 'not isidentity(a, 10)' in the same way, since what isidentity does could always be changed by rebinding the name.

IsCPythonBytecode written at 21:08:03; Add Comment

The place of the 'is' syntax in Python

Over on Twitter, I said:

A Python cold take (given how long it's taken me to arrive at it): 'is' should not be a keyword, it should be a built-in function that you're discouraged from using unless you really know what you're doing. As a keyword it's too tempting.

Python has two versions of equality, in ==, which is plain equality, and is, which is object identity; 'a is b' is true if and only if a and b refer to the same object. Since the distinction between names and values is fundamental to Python, we definitely need a way of testing this (for example, to explore a puzzling mistake I once made). However, I'm not so sure it should be a language keyword.

The issue with 'is' as a language keyword is that it makes using object identity temptingly easy; after all, there's a keyword for it, part of the language syntax. It's as if you're supposed to use it. The first problem with this is simply that object identity is a relatively advanced Python concept, one that's a bit tricky to get your head around. Python code that genuinely needs to use is instead of == is almost invariably doing something tricky, and we should generally avoid inviting people to routinely write code that at least looks like tricky code. The second problem is that in practice object identity can be tricky because Python implementations (especially CPython) can quietly make objects be the same thing (and thus 'a is b' will be true) when you didn't expect them to be. It's possible to write safe code that uses 'is', but you need to know a fair bit about what you're doing; perfectly sensible looking code can conceal subtle bugs.

(When Python will give you the same object for two apparently different things depends on the specific version of (C)Python and also sometimes the exact way that you created the objects. It can get quite weird and involved.)

There are at least two reasons I can think of to still have is as a keyword. The first is that as a keyword, what it does is guaranteed by the language and is not subject to being modified by people who play games with namespaces in the way that, say, isinstance() can be changed. Changing what isinstance() does by defining your own version is probably a terrible idea, but you can do it if you feel the urge. Meanwhile, is is beyond the reach of anything but bytecode rewriting. The second is that because is is part of the language and isn't subject to being changed, it can be implemented in a way that makes it faster than a built-in function. Built-in functions need to go through a global name lookup when they're used, just in case, while is can be just done directly since it's part of the language.

(Local variables are fast because they avoid this lookup.)

PS: Of course by now all of this is entirely theoretical. It's entirely too late for Python to drop 'is' as a keyword, and even thinking about it is a bit silly. But I apparently twitch a bit when I see 'is' casually used in code examples, and that's sort of what inspired the tweet that led to this entry.

IsSyntaxPlace written at 00:28:49; Add Comment

(Previous 10 or go back to January 2020 at 2020/01/30)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.