Where Linux's load average comes from in the kernel
Suppose, not hypothetically, that you have a machine that periodically has its load average briefly soar to relatively absurd levels for no obvious reason; the machine is normally at, say, 0.5 load average but briefly spikes to 10 or 15 a number of times a day. You would like to know why this happens. Capturing 'top' output once a minute doesn't show anything revealing, and since these spikes are unpredictable it's difficult to watch top continuously to possibly catch one in action. A starting point is to understand how the Linux kernel puts the load average together.
(This is on a multi-user login machine with a lot of people logged in, so one obvious general hypothesis is that there is some per-user background daemon or process that periodically wakes up and sometimes they're all synchronized together, creating a brief load spike as they all try to get scheduled and run at once.)
The core calculations are in kernel/sched/loadavg.c.
As lots of things will tell you, the load average is "an exponentially
decaying average" of a series of instantaneous samples. These samples
are taken at intervals of
LOAD_FREQ (currently set to 5 seconds
To simplify a complicated implementation, the samples are the sum
of the per-CPU counts of the number of running tasks and uninterruptible
tasks (nr_running and nr_uninterruptible). The every five
second load average calculation doesn't compute these two counts
on the fly; instead they're maintained by the general kernel
scheduling code and then sampled. This means that if we somehow hook into
this periodic sampling with things like eBPF, we can't see the exact tasks
and programs involved in creating the load average; more or less the best
we could do would be to see the total numbers at each sample.
(This would already tell us more information than is generally exposed by 'top' and the load average; if nothing else, it might tell us how fast the spike happens, how high it reaches, and whether it's running tasks or tasks waiting on IO.)
When a task adds to or reduces the number of uninterruptible tasks is somewhat accessible. A task exits this state in ttwu_do_activate() in kernel/sched/core.c under conditions that are probably accessible to eBPF. A task increases the number of uninterruptible tasks in the depths of __schedule(), depths which don't seem amenable to hooking by eBPF; however I think you might be able to conditionally hook deactivate_task() to record this.
As you might expect, tasks become runnable or stop being runnable all somewhat all over the place, and the code that tracks and implements this is distributed around a number of kernel functions, some of them inlined ones from headers (plus, the kernel has more than one scheduler). It's not clear to me if there's any readily accessible eBPF tracepoints or kernel functions that could be used to hook into when a particular task becomes runnable or stops being runnable. There does seem to be a scheduler tracepoint for when this number changes, but I'm not certain if you can extract the task information from the tracepoints (and I think sometimes a number of tasks can become runnable all at once).
The current instant value of nr_running is exposed in /proc/loadavg, as covered in proc(5), and as a result often makes it into metrics systems. I don't think nr_uninterruptible is exposed anywhere that's readily accessible. However, cgroups v1 does report it through the general taskstats accounting interface, sort of per cgroupstats.rst, with the disclaimer that I'm not sure how this interacts with systemd's cgroups stunts. The kernel's tools/accounting/getdelays.c does seem to work system-wide, if you run it as eg 'getdelays -C /sys/fs/cgroup/cpu,cpuacct', but you need to build the right version of it; the version in the kernel's current source tree may not compile on older kernels.
Having gone through all of this, what I've learned is that tracing this area with eBPF is probably too much work, but we could probably get a fair with dumping basic process information every few seconds, since the load average is only updated every five seconds or so and what matters is the current state of things close to that time. Sadly I don't think the kernel offers a way to get a bulk dump of current processes and their states, say via netlink; instead I think you have to go through /proc yourself (or let ps do it with an appropriate output format).