Wandering Thoughts archives

2024-07-22

The challenges of working out how many CPUs your program can use on Linux

In yesterday's entry, I talked about our giant (Linux) login server and how we limit each person to only a small portion of that server's CPUs and RAM. These limits sometimes expose issues in how programs attempt to work out how many CPUs they have available so that they can automatically parallelize themselves, or parallelize their build process. This crops up even in areas where you might not expect it; for example, both the Go and Rust compilers attempt to parallelize various parts of compilation using multiple threads within a single compiler process.

In Linux, there are at least three different ways to count the number of 'CPUs' that you might be able to use. First, your program can read /proc/cpuinfo and count up how many online CPUs there are; if code does this in our giant login server, it will get 112 CPUs. Second, your program can call sched_getaffinity() and count how many bits are set in the result; this will detect if you've been limited to a subset of the CPUs by a tool such as taskset(1). Finally, you can read /proc/self/cgroup and then try to find your cgroup to see if you've been given cgroup-based resource limits. These limits won't be phrased in terms of the number of CPUs, but you can work backward from any CPU quota you've been assigned.

In a shell script, you can do the second with nproc, which will also give you the full CPU count if there are no particular limits. As far as I know, there's no straightforward API or program that will give you information on your cgroup CPU quota if there is one. The closest it looks you can come is to use cgget (if it's even installed), but you have to walk all the way back up the cgroup hierarchy to check for CPU limits; it's not necessarily visible in the cgroup (or cgroups) listed in /proc/self/cgroup.

Given the existence of nproc and sched_getaffinity() (and how using them is easier than reading /proc/cpuinfo), I think a lot of scripts and programs will notice CPU affinity restrictions and restrict their parallelism accordingly. My experience suggests that almost nothing is looking for cgroup-based restrictions. This occasionally creates amusing load average situations on our giant login server when a program will see 112 CPUs 'available' and promptly try to use all of them, resulting in their CPU quota being massively over-subscribed and the load average going quite high without actually really affecting anyone else.

(I once did this myself on the login server by absently firing up a heavily parallel build process without realizing I was on the wrong machine for it.)

PS: The corollary of this is that if you want to limit the multi-CPU load impact of something, such as building Firefox from source, it's probably better to use taskset(1) than to do it with systemd features, because it's much more likely that things will notice the taskset limits and not flood your process table and spike the load average. This will work best on single-user machines, such as your desktop, where you don't have to worry about coordinating taskset CPU ranges with anyone or anything else.

linux/CPUCountingChallenges written at 22:20:00;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.