Notes on the Linux kernel's 'pressure stall information' and its meanings

June 29, 2022

As we increasingly move to Ubuntu 22.04, enough of our machines now have the Linux kernel's somewhat new Pressure Stall Information (also) that I've been investigating adding PSI information to our per-host dashboards in our metrics setup. In the process I realized that I didn't understand the PSI information as well as I needed to, so here are some notes.

The raw global PSI information appears in /proc/pressure in files called 'cpu', 'io', and 'memory'. Each file looks like this:

some avg10=4.35 avg60=1.11 avg300=0.25 total=46338094
full avg10=3.92 avg60=1.02 avg300=0.22 total=37445468

As covered in the documentation, the 'some' line is about when some processes (tasks) are stalled waiting for the particular thing, while the 'full' line is when all non-idle tasks are stalled and some of them are waiting on the resource. This isn't the definition given in the documentation, but it is what's described in kernel/sched/psi.c. The difference between 'some' and 'full' is nicely summed up as part of the comments in psi.c (emphasis mine):

The percentage of wallclock time spent in those compound stall states gives pressure numbers between 0 and 100 for each resource, where the SOME percentage indicates workload slowdowns and the FULL percentage indicates reduced CPU utilization: [...]

(The global 'cpu' file doesn't have a meaningful full line for good reasons. In cgroup v2, a specific cgroup can have a meaningful 'full' CPU line with non-zero values, for example if you're limiting its CPU usage.)

The 'avg' numbers are percentages of the time over the past 10, 60, or 300 seconds when the condition is true, running from 0 to 100. The total= number is the cumulative number of microseconds when the condition has been true. The Prometheus host agent reports the 'total=' number normalized to seconds, as probably do other metrics systems.

(The Prometheus host agent reports 'some' pressure lines as metrics with 'waiting' in their names and 'full' lines as metrics with 'stalled' in their names.)

The 'full' numbers are a subset of the 'some' numbers. If all non-idle tasks are stalled on the resource, both 'some' and 'full' count up; if only some are, only 'some' counts up. At times of high IO or memory pressure, 'full' and 'some' will probably be almost equal (which suggests that the difference between them may be worth paying attention to). As far as the 'total=' time goes, it normally increases no faster than real time, since the 'total=' time merely counts the time during which the condition has been true.

(As a result, if you take a per-second rate of increase in the total time, for example by using the PromQL rate() function, what you get is a 0.0 to 1.0 percentage of the time over your time range, or equivalently the average amount of time per second that the condition has been true.)

As covered in the extensive comments in kernel/sched/psi.c, the kernel's view of lost potential, which drives these pressure indicators, is scaled by the number of CPUs relative to the amount of contention. In general the comments in kernel/sched/psi.c are extensive and detailed, and are well worth reading to see the fine details and examples. Based on the comments, a machine with a lot of CPUs is going to need a lot of delayed tasks to reach large 'full' numbers. Although I haven't carefully read the code, I'd assume that this scaling is also applied to the 'total=' time numbers so that everything works out right.

Because of how they're defined, the 'full' state for both IO and memory are inverses of CPU busyness; as your CPU goes to full usage, 'full' stalls on IO and memory must go to zero, because there are active non-idle tasks and you're not wasting any CPU waiting. This implies that the higher your 'some' CPU metric, the more you're driving 'full' IO and memory metrics to zero. If you manage 100% 'some' CPU, you should definitely have 0% 'full' IO and memory stalls. However, I believe that you can have high 'some' IO or memory at the same time as high 'some' CPU, provided that you have enough tasks doing enough things at once.

I believe that it's possible to have high 'full' IO stalls with only a few processes waiting for IO, provided that your system is otherwise idle. If I'm working through the math from kernel/sched/psi.c correctly, if you have one task waiting on IO and no other non-idle tasks, you will have a 100% 'full' IO number. You probably can't get a system that idle, though. The corollary of this is that a high 'full' IO number doesn't necessarily mean that you have a problem; it may mean that your system isn't doing much other than IO.

Since IO pressure requires processes (tasks) to actually wait on IO being completed, it's implicitly biased against read IO. Read IO is almost always synchronous, with things waiting for it to finish and thus being counted for IO pressure, while write IO is often asynchronous, with no task explicitly waiting for it to complete. The IO pressure information is fair (tasks really are waiting for their IO), but you can't take it as a complete picture of how loaded and busy your IO system is.

(As usual, the process of writing this entry has left me much better informed than I was when I started.)

Sidebar: The documentation versus psi.c

The kernel PSI documentation describes the 'full' line this way (emphasis mine):

The “full” line indicates the share of time in which all non-idle tasks are stalled on a given resource simultaneously.

Interpreted as written, this definition would imply that 'full' IO and memory stalls cannot happen at the same time. You'd need all tasks to be stalled on a single thing (IO or memory), not some stalled on one and others stalled on the other.

The kernel/sched/psi.c definition for 'full' merely requires there to be no productive tasks running:

FULL = nr_delayed_tasks != 0 && nr_productive_tasks == 0

Life is not quite this simple, because the definition of productive is subtly different between IO and memory such that I think it's possible for IO to consider there to be a productive task running but memory to consider there not to be. See the comments in psi.c for the details.

Written on 29 June 2022.
« What symmetric and asymmetric IP routing are
Having one is often much easier than having more than one »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jun 29 22:52:21 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.