== A realization about the Linux CPU pressure stall information Modern versions of the Linux kernel have a set of metrics that are intended to give you a better high-level indicator of where (or why) your system is 'loaded' than the venerable load average, in the form of [[Pressure Stall Information https://www.kernel.org/doc/html/latest/accounting/psi.html]] ([[also https://facebookmicrosites.github.io/psi/docs/overview]]). The top level version of this is exposed as three files in _/proc/pressure_, called _cpu_, _memory_, and _io_. If your distribution uses cgroup2, each cgroup also has its own version of these that's specific to the cgroup. (Recent versions of systemd will use this information as part of [[systemd-oomd https://www.freedesktop.org/software/systemd/man/systemd-oomd.service.html]] and probably other things.) As covered in the documentation, the _io_ and _memory_ PSI files have two lines, one line for 'some' and one line for 'full', while _cpu_ has only 'some'. The 'some' metric is for when some (one or more) tasks was delayed for lack of the resource, while the 'full' metric is for when *all* tasks were delayed. When I first read about all of this, I didn't immediately see why the _cpu_ pressure information only had the 'some' metric; I had to think about it. The answer is that unlike memory and IO, where tasks can be entirely stalled with none of them getting any of the resource yet, ~~there's always some task getting CPU if there's demand for it~~. A 'full' stall on CPU would require that you have runnable tasks but that nothing was actually being scheduled. Looking at the total system (for _/proc/pressure_), it's basically impossible for CPU to stall this way without a serious kernel problem; basically the kernel scheduler would have to be stuck somehow. However, I'm not sure that this is impossible for an individual cgroup; since you can arrange a hierarchy of per-cgroup priorities for CPU time, it wouldn't surprise me if you could completely starve a victim cgroup. Right now the '_cpu.pressure_' file for cgroups only has the 'some' metric, just like the _/proc/pressure_ version, but perhaps that will change in the future. (Linux also has low-priority 'idle' scheduling, as covered in [[sched(7) https://man7.org/linux/man-pages/man7/sched.7.html]], so you might be able to manipulate all of the tasks in a cgroup into that so they get starved that way.)