2024-01-18
Notes on the Linux kernel's 'irq' pressure stall information and meaning
For some time, the Linux kernel has had both general and per-cgroup 'Pressure Stall Information', which is intended to tell you something about when things on your system are stalling on various resources. The initial implementation provided this information for cpu usage, obtaining memory, and waiting on IO, as I wrote up in my notes on PSI. In kernel 6.1, an additional PSI file was added, 'irq' (if your kernel is built with CONFIG_IRQ_TIME_ACCOUNTING, which current Fedora kernels are).
One important reference for this is the kernel commit that added this feature. Another is Eva Lacy's Pressure Stall Information in Linux. However, both of these can be a little opaque about what's actually being calculated and reported in 'irq'.
The /proc/pressure/irq file will typically look like the other pressure files, with the exception that it only has a 'full' line:
full avg10=0.00 avg60=0.00 avg300=0.00 total=3753500244
As usual, the 'total=' number is the cumulative time in microseconds that tasks have been stalled on IRQ or soft IRQs. What 'stalled' means here is that at the end of every round of IRQ and softirq handling, the kernel works out the total amount of time that it spent doing this (the 'delta time' in the commit message), looks to see if there's a meaningful current task (I believe 'on this CPU'), and if there is, the time is added to 'total'.
There is no 'some' line for the inverse reason of why there's no 'full' line in the global 'cpu' pressure file. In the CPU case, there's always something running (globally), so you can't have a complete stall on CPU the way you can have on memory or IO, where all tasks could be waiting to get more memory or have their IO complete. In the case of IRQ handling, either there was no task running (on the CPU), in which case nothing is impeded by the IRQ handling time, or there was a task running at the time the IRQ handling happened, in which case it completely stalled for the duration.
If I'm understanding all of this correctly, one corollary is that 'irq' pressure only happens to the extent that your system is busy. Given a fixed amount of time spent handling IRQs and softirqs, the amount of that time that shows up in /proc/pressure/irq depends on how often it's interrupting a (running) task, which depends on how many running tasks you have. On an idle system, the IRQ and softirq time isn't preempting anything and it's 'free', at least from the perspective of the PSI system.
Based on reading proc(5), you can get the total amount of time that the system has spent handling IRQs and softirqs from the 6th and 7th numbers on the first 'cpu' line in /proc/stat (the 6th number will be zero if IRQ time accounting isn't enabled for your kernel). On most machines, this will be in units of 100ths of a second. You can then cross-compare this to the total in /proc/pressure/irq. On my home Fedora machine (the one the sample line comes from), the irq pressure time is about 3% of the total IRQ handling time; on my work desktop, it's currently about 6%.
(I suspect that all of this means that /proc/pressure/irq won't be very interesting on many systems, which is good because tools like the Prometheus host agent may not have been updated to report it.)
PS: Ubuntu 22.04 kernels don't set CONFIG_IRQ_TIME_ACCOUNTING, although they're too old to have /proc/pressure/irq. As far as I can tell, this is still the case in the future 24.04 kernel ('Noble Numbat', and thus 'noble' on places like packages.ubuntu.com). This is potentially a little bit unfortunate, but it's apparently been this way for some time.