Wandering Thoughts archives

2024-01-18

Notes on the Linux kernel's 'irq' pressure stall information and meaning

For some time, the Linux kernel has had both general and per-cgroup 'Pressure Stall Information', which is intended to tell you something about when things on your system are stalling on various resources. The initial implementation provided this information for cpu usage, obtaining memory, and waiting on IO, as I wrote up in my notes on PSI. In kernel 6.1, an additional PSI file was added, 'irq' (if your kernel is built with CONFIG_IRQ_TIME_ACCOUNTING, which current Fedora kernels are).

One important reference for this is the kernel commit that added this feature. Another is Eva Lacy's Pressure Stall Information in Linux. However, both of these can be a little opaque about what's actually being calculated and reported in 'irq'.

The /proc/pressure/irq file will typically look like the other pressure files, with the exception that it only has a 'full' line:

full avg10=0.00 avg60=0.00 avg300=0.00 total=3753500244

As usual, the 'total=' number is the cumulative time in microseconds that tasks have been stalled on IRQ or soft IRQs. What 'stalled' means here is that at the end of every round of IRQ and softirq handling, the kernel works out the total amount of time that it spent doing this (the 'delta time' in the commit message), looks to see if there's a meaningful current task (I believe 'on this CPU'), and if there is, the time is added to 'total'.

There is no 'some' line for the inverse reason of why there's no 'full' line in the global 'cpu' pressure file. In the CPU case, there's always something running (globally), so you can't have a complete stall on CPU the way you can have on memory or IO, where all tasks could be waiting to get more memory or have their IO complete. In the case of IRQ handling, either there was no task running (on the CPU), in which case nothing is impeded by the IRQ handling time, or there was a task running at the time the IRQ handling happened, in which case it completely stalled for the duration.

If I'm understanding all of this correctly, one corollary is that 'irq' pressure only happens to the extent that your system is busy. Given a fixed amount of time spent handling IRQs and softirqs, the amount of that time that shows up in /proc/pressure/irq depends on how often it's interrupting a (running) task, which depends on how many running tasks you have. On an idle system, the IRQ and softirq time isn't preempting anything and it's 'free', at least from the perspective of the PSI system.

Based on reading proc(5), you can get the total amount of time that the system has spent handling IRQs and softirqs from the 6th and 7th numbers on the first 'cpu' line in /proc/stat (the 6th number will be zero if IRQ time accounting isn't enabled for your kernel). On most machines, this will be in units of 100ths of a second. You can then cross-compare this to the total in /proc/pressure/irq. On my home Fedora machine (the one the sample line comes from), the irq pressure time is about 3% of the total IRQ handling time; on my work desktop, it's currently about 6%.

(I suspect that all of this means that /proc/pressure/irq won't be very interesting on many systems, which is good because tools like the Prometheus host agent may not have been updated to report it.)

PS: Ubuntu 22.04 kernels don't set CONFIG_IRQ_TIME_ACCOUNTING, although they're too old to have /proc/pressure/irq. As far as I can tell, this is still the case in the future 24.04 kernel ('Noble Numbat', and thus 'noble' on places like packages.ubuntu.com). This is potentially a little bit unfortunate, but it's apparently been this way for some time.

PSIIRQNumbersAndMeanings written at 22:24:00;

2024-01-17

Some interesting metrics you can get from cgroup V2 systems

In my roundup of what Prometheus exporters we use, I mentioned that we didn't have a way of generating resource usage metrics for systemd services, which in practice means unified cgroups (cgroup v2). This raises the good question of what resource usage and performance metrics are available in cgroup v2 that one might be interested in collecting for systemd services.

You can want to know about resource usage of systemd services (or more generally, systemd units) for a variety of reasons. Our reason is generally to find out what specifically is using up some resource on a server, and more broadly to have some information on how much of an impact a service is having. I'm also going to assume that all of the relevant cgroup resource controllers are enabled, which is increasingly the case on systemd based systems.

In each cgroup, you get the following:

  • pressure stall information for CPU, memory, IO, and these days IRQs. This should give you a good idea of where contention is happening for these resources.

  • CPU usage information, primarily the classical count of user, system, and total usage.

  • IO statistics (if you have the right things enabled), which are enabled on some but not all of our systems. For us, this appears to have the drawback that it doesn't capture information for NFS IO, only local disk IO, and it needs decoding to create useful information (ie, information associated with a named device, which you find out the mappings for from /proc/partitions and /proc/self/mountinfo).

    (This might be more useful for virtual machine slices, where it will probably give you an indication of how much IO the VM is doing.)

  • memory usage information, giving both a simple amount assigned to that cgroup ('memory.current') and a relatively detailed breakdown of how much of what sorts of memory has been assigned to the cgroup ('memory.stat'). As I've found out repeatedly, the simple number can be misleading depending on what you want to really know, because it includes things like inactive file cache and inactive, reclaimable kernel slab memory.

    (You also get swap usage, in 'memory.swap.current', and there's also 'memory.zswap.current'.)

    In a Prometheus exporter, I might simply report all of the entries in memory.stat and sort it out later. This would have the drawback of creating a bunch of time series, but it's probably not an overwhelming number of them.

Although the cgroup doesn't directly tell you how many processes and threads it contains, you can read 'cgroup.procs' and 'cgroups.threads' to count how many entries they have. It's probably worth reporting this information.

The root cgroup has some or many of these files, depending on your setup. Interestingly, in Fedora and Ubuntu 22.04, it seems to have an 'io.stat' even when other cgroups don't have it, although I'm not sure how useful this information is for the root cgroup.

Were I to write a systemd cgroup metric collector, I'd probably only have it report on first level and second level units (so 'systemd.slice' and then 'cron.service' under systemd.slice). Going deeper than that doesn't seem likely to be very useful in most cases (and if you go into user.slice, you have cardinality issues). I would probably skip 'io.stat' for the first version and leave it until later.

PS: I believe that some of this information can be visualized live through systemd-cgtop. This may be useful to see if your particular set of systemd services and so on even have useful information here.

CgroupV2InterestingMetrics written at 22:40:46;

2024-01-12

What we use ZFS on Linux's ZED 'zedlets' for

One of the components of OpenZFS is the ZFS Event Daemon ('zed'). Old ZFS hands will understand me if I say that it's the OpenZFS equivalent of the Solaris/Illumos fault management system as applied to ZFS; for other people, it's best described as ZFS's system for handling (kernel) ZFS events such as ZFS pools experiencing disk errors. Although the manual page obfuscates this a bit, what ZED does is it runs scripts (or programs in general) from a particular directory, normally /etc/zfs/zed.d, choosing what scripts to run for particular events based on their names. OpenZFS ships with a number of zedlets ('zedlet' is the name for these scripts), and you can add your own, which we do in our ZFS fileserver environment.

The standard ZED setup supports a number of relatively standard notification methods, including email; we enable this in our /etc/zfs/zed.d/zed.rc. The email you get through these standard notifications is a bit generic but it's a useful starting point and fallback. Beyond this, we have three additional zedlets we add:

  • one zedlet simply syslogs full details about almost all events by doing almost literally the following:

    printenv | fgrep 'ZEVENT_' | sort | fmt -999 |
      logger -p daemon.info -t 'cslab-zevents'
    

    ZED has an 'all-syslog.sh' zedlet that's normally enabled, but it doesn't capture absolutely everything this way and it believes in reformatting information a bit. We wanted to capture full event information so we could do as complete a reconstruction of things as possible later.

  • one zedlet syslogs when vdev state changes happen (and what they are) and immediately triggers our ZFS status reporting and spares handling system. Because ZED treats individual disks as vdevs, this is triggered for things like loss of disks and disk read, write, or checksum errors. Our own system for this will then email us a report about issues and start any sparing that's necessary (which will probably result in more email).

  • one zedlet syslogs when resilvers complete and triggers a run of our ZFS status reporting and spares handling system. This will report to us when a pool becomes healthy again and possibly start another round of sparing if we were holding back to not have too many resilvers happening at once.

Because ZED has a hard-coded ten second timeout on zedlets, we have to run our status reporting and spares handling in the background of the zedlet, which means we need to use some straightforward shell locking.

The net effect of this setup is that we'll generally get at least two emails if a disk has problems. One email will be generically formatted and come from the standard ZED email notification generated by the various '*-notify.sh' zedlets. The second email comes from our own ZFS status reporting system, using our own tools to report and summarize ZFS pool status with informative (for us) disk names and so on.

Sidebar: Why we have our own email reporting

A typical status report can look something like this:

Subject: sanhealthmon: details of ZFS pool problems on sanshui
Newly degraded pools:
  fs16-matter-02 fs16-rahulgk-01 fs16-vision-02

[...]
pool:     fs16-rahulgk-01
overall:  problems
problems: disk(s) have repaired errors
config:
  mirror      ONLINE   
    disk01/0  ONLINE   
    disk09/0  REPAIRED (errors: 1 read/0 write/0 checksum)
[...]

This is a lot more readable (for us) than decoding the equivalent in the normal ZFS email, and it also often summarizes the state of multiple pools if all of them have experienced errors simultaneously (because, for example, they all use the same physical disk and that physical disk has had a problem).

ZFSZEDOurZedletUse written at 22:46:04;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.