2019-01-18
Linux CPU numbers are not necessarily contiguous
In Linux, the kernel gives all CPUs a number; you can see this number
in, for example, /proc/stat
:
cpu0 [...] cpu1 [...] cpu2 [...] cpu3 [...]
Under normal circumstances, Linux has contiguous CPU numbers that
start at 0 and go up to however many CPUs the system has. However,
this is not guaranteed and is not always the case on certain live
configurations. It's perfectly possible to have a configuration
where, for example, you have sixteen CPUs that are numbered 0 to 7
and 16 to 23, with 8 to 15 missing. In this situation, /proc/stat
will match the kernel's numbering, with lines for cpu0
through
cpu7
and cpu16
through cpu23
. If your code sees this and
decides to fill in the missing CPUs 8 through 15, it will be wrong.
You might think that no code could possibly make this mistake, but it's not quite that simple. If, for example, you make a straightforward array to hold CPU status, read in information from various sources, and then print out your accumulated data for CPUs 0 through the highest CPU you saw, you will invent those missing CPUs 8 through 15 (possibly with random unset data for them). In situations like this, you need to actively keep track of what CPUs in your array are valid and what ones aren't, or you need a more sophisticated data structure.
(If you've created an API that says 'I return an array of CPU information for CPUs 0 through N', well, you have a problem. You're probably going to need an API change; if this is in a structure, at least an API addition of a new field to tell people which CPUs are valid.)
I can see why people make this mistake. It's tempting to have simple code, displays, and so on, and almost all Linux machines have contiguous CPU numbering so your code will work almost everything (we only wound up with non-contiguous numbering through bad luck). But, sadly, it is a mistake and sooner or later it will bite either you or someone who uses your code.
(It's unfortunate that doing this right is more complicated. Life certainly would be simpler if Linux guaranteed that CPU numbers were always contiguous, but given that CPUs can come and go, that could cause CPU numbers to not always refer to the same actual CPU over time, which is worse.)
Sidebar: How we have non-contiguous CPU numbers
We have one dual-socket machine with hyperthreading where one socket has cooling problems and we've shut it down by offlining the CPUs. Each socket has eight cores, and Linux enumerated one side of the HT pairs for both sockets before starting on the other side of the HT pairs. CPUs 0 through 7 and 16 through 23 are the two HTs for the eight cores on the first socket; CPUs 8-15 would be the first set of CPUs for the second socket, if they were online, and then CPUs 24-32 would be the other side of the HT pairs.
In general, HT pairing is unpredictable. Some machines will pair adjacent CPU numbers (so CPU 0 and CPU 1 are a HT pair) and some machines will enumerate all of one side before they enumerate all of the other. My Ryzen-based office workstation enumerates HT pairs as adjacent CPU numbers, so CPU 0 and 1 are a pair, while my Intel-based home machine enumerates all of one HT side before flipping over to enumerate all of the other, so CPU 0 and CPU 6 are a pair.
(I prefer the Ryzen ordering because it makes life simpler.)
It's possible that we should be doing something less or other than offlining all of the CPUs for the socket with the cooling problem (perhaps the BIOS has an option to disable one socket entirely). But offlining them all seemed like the most thorough and sure option, and it certainly was simple.