2022-01-27
The Linux kernel, simultaneous multithreading, and process scheduling
Back in an earlier entry on simultaneous multithreading, I said that I expected operating systems, Linux included, to generally schedule processes on to a single CPU of each core before it started doubling up processes on two CPUs of a single core. This raises the question of whether the Linux process scheduler is SMT-aware and what it seems to do in practice.
The answer to the first question is that it clearly is SMT aware, although I haven't traced through the code to see exactly what effects it has. There is a SCHED_SMT kernel configuration option that affects various things, and there are various comments about this (and code) in kernel/sched/fair.c and kernel/sched/topology.c, among other spots. The comments I've skimmed through suggest that the kernel does all of the obvious things with SMT pairs, like considering the caches 'hot' for a process on both CPUs if it was running recently on one of them.
For how things go in practice, a fairly current Linux kernel (the
Fedora 34 version of 5.15.15) running one CPU consuming process per
core (not CPU) seems to mostly distribute CPU load the way I'd
expect. Using tools like htop
and mpstat
doesn't let me see the
CPU (or core) scheduling history of individual processes, but the
kernel doesn't do anything obviously weird looking. CPU utilization
does hop from CPU to CPU from time to time, which I suspect is an
artifact of other running processes preempting the CPU hogs off
their original CPU and on to the other CPU for that core. A CPU
load that makes a bunch of system calls (basically an infinite loop
in the shell) looks more erratic in htop
; the CPU load appears
to bounce around a lot and there's a bunch of system time involved
too.
My conclusion from looking is that the Linux kernel in a normally
operating system doesn't do anything glaringly obvious about what
CPU gets used across cores. For instance, it's not like the first
CPU of each core gets used almost all of the time and the load only
spills over to the second CPU during high load periods. Even with
very low load, every CPU can get used from time to time (you can
see this in both htop
and mpstat
).
(In short, it's all boring, with no surprises or interesting things that I could see.)
Django and Apache HTTP Basic Authentication (and REMOTE_USER
)
We have a Django application, and part of it exists behind Apache HTTP Basic Authentication. For reasons beyond the scope of this entry, I was recently rediscovering some things about how Django interacts with Apache HTTP Basic Authentication, and so I want to write them down for myself before I forget them again.
First, the starting point in the Django documentation for this is
not to search for 'HTTP Basic Authentication' or anything like that,
but for the howto on authenticating with REMOTE_USER
,
which is the environment variable that Apache injects when it's
already authenticated something. I believe that if you search for
'Django' with 'Basic Authentication' on search engines, you tend
to get information about making Django or Django-related things
actually perform the server side of HTTP Basic authentication itself.
This is fair enough but can be confusing.
Second, you only need to configure Django itself to authenticate
with REMOTE_USER
if you want to use Django's own authentication
for something, such as access and authorization in its admin site.
It's perfectly valid (although potentially annoying) to authenticate
and limit access to your Django site (or parts of it) in your Apache
configuration with Apache's HTTP Basic Authentication but have a
separate Django login step to access the Django admin site or even
parts of your application (which will then be tracked with cookies
and so on). If you want to do this, you don't want to add Django's
RemoteUserMiddleware and so on into your Django settings.
(You'll have to manage Apache users and Django users separately, passwords included, and they won't be the same thing. This might wind up being confusing.)
If you do have Django authenticating with REMOTE_USER, you need your Django database superuser to be something you can authenticate with through Apache. If you cleverly set your database superuser to 'admin' but you have no 'admin' in your Basic Auth database, you will be sad. It's possible to get yourself out of this in a couple of ways, but it's better to avoid it in the first place.
(When you do have Django authenticating this way, ever person who uses your Django app through HTTP Basic Authentication will wind up with an entry in the Django 'User' table. Purging old logins that no longer exist is up to you, if you care. For people who you want to be able to use the Django admin site, you need to set them as at least 'Staff' in the Django User table. You can set them as database superusers too.)
It's not necessary to use Django's REMOTE_USER support in order to
make use of the authentication information yourself, as long as Apache
has HTTP Basic Authentication active. You can retrieve the login name
from the $REMOTE_USER
environment variable and look it up in your
own 'User' table by hand, as we do. You
may or may not want to automatically create new entries for new users,
the way Django does by default. We don't because new people require some
additional configuration on our side.
The corollary to this is that you can use and test your entire site
under Apache HTTP Basic Authentication without having Django properly
wired up to use REMOTE_USER
, without noticing. I believe that
this potentially actually matters, because I believe that Django
does some things with sessions differently when you have the
RemoteUser* things enabled, and this interacts with Django's
CSRF protections. Which we've had mysterious problems with (also).