2024-07-22
The challenges of working out how many CPUs your program can use on Linux
In yesterday's entry, I talked about our giant (Linux) login server and how we limit each person to only a small portion of that server's CPUs and RAM. These limits sometimes expose issues in how programs attempt to work out how many CPUs they have available so that they can automatically parallelize themselves, or parallelize their build process. This crops up even in areas where you might not expect it; for example, both the Go and Rust compilers attempt to parallelize various parts of compilation using multiple threads within a single compiler process.
In Linux, there are at least three different ways to count the
number of 'CPUs' that you might be able to use. First, your program
can read /proc/cpuinfo and count up how many online CPUs there are;
if code does this in our giant login server, it will get 112 CPUs. Second,
your program can call sched_getaffinity()
and count how many bits are set in the result; this will detect if
you've been limited to a subset of the CPUs by a tool such as
taskset(1)
.
Finally, you can read /proc/self/cgroup and then try to find your
cgroup to see if you've been given cgroup-based resource limits. These limits won't be phrased in
terms of the number of CPUs, but you can work backward from any CPU
quota you've been assigned.
In a shell script, you can do the second with nproc
, which will
also give you the full CPU count if there are no particular limits.
As far as I know, there's no straightforward API or program that
will give you information on your cgroup CPU quota if there is one.
The closest it looks you can come is to use cgget (if it's even installed), but
you have to walk all the way back up the cgroup hierarchy to check
for CPU limits; it's not necessarily visible in the cgroup (or
cgroups) listed in /proc/self/cgroup.
Given the existence of nproc
and sched_getaffinity()
(and
how using them is easier than reading /proc/cpuinfo), I think a lot
of scripts and programs will notice CPU affinity restrictions and
restrict their parallelism accordingly. My experience suggests that
almost nothing is looking for cgroup-based restrictions. This
occasionally creates amusing load average situations on our giant
login server when a program will see 112 CPUs 'available' and
promptly try to use all of them, resulting in their CPU quota being
massively over-subscribed and the load average going quite high
without actually really affecting anyone else.
(I once did this myself on the login server by absently firing up a heavily parallel build process without realizing I was on the wrong machine for it.)
PS: The corollary of this is that if you want to limit the multi-CPU
load impact of something, such as building Firefox from source,
it's probably better to use taskset(1)
than to do it with
systemd features, because it's much more likely that things will
notice the taskset limits and not flood your process table and spike
the load average. This will work best on single-user machines, such
as your desktop, where you don't have to worry about coordinating
taskset CPU ranges with anyone or anything else.