2022-04-18
How to talk to a local IPMI under OpenBSD
Much like Linux, modern versions of OpenBSD are theoretically able to talk to a suitable local IPMI using the standard ipmi(4) kernel driver. This is imprecise although widely understood terminology; in more precise terms, OpenBSD can talk to a machine's BMC (Baseboard Management Controller) that implements the IPMI specification using one of a number of standard interfaces, as covered in the "System Interfaces" section of ipmi(4). However, OpenBSD throws us a curve ball in that the ipmi(4) driver is normally present in the default OpenBSD kernel but not enabled.
If the ipmi driver is present but not enabled and your machine nas an IPMI that OpenBSD can talk to, the kernel boot messages will report something like:
ipmi at mainbus0 not configured
If you don't see any mention of 'ipmi' in the boot messages and you're using a normal kernel, your machine almost certainly doesn't have a recognized IPMI and you can stop here. If you do see this 'not configured' message, you most likely have an IPMI that OpenBSD can talk to and you now need to enable the IPMI driver.
If you're using OpenBSD 7.0 or later, you enable the driver by creating or editing the file /etc/bsd.re-config (see bsd.re-config(5)) to contain:
enable ipmi
(This will often be the only line in bsd.re-config, partly because the file format doesn't allow comments.)
After you've set up bsd.re-config, you need to reboot at least once and perhaps twice. After this the kernel will recognize your IPMI with messages that look something like this:
ipmi0 at acpi0: version 2.0 interface KCS iobase 0xca8/8 spacing 4 ipmi at mainbus0 not configured iic0: skipping sensors to avoid ipmi0 interactions
(You may not see the iic0 message.)
In OpenBSD 6.9 and previous versions there is no bsd.re-config, so
you need to manually create a new kernel image with config(8) that has the ipmi driver
specifically enabled. A typical usage would be (with 'ukc>' being
the prompts from config
):
# config -e -o /bsd.new /bsd [...] ukc> enable ipmi [some messages about it] ukc> quit # mv /bsd /bsd.last && mv /bsd.new /bsd # reboot
(Then you'll see the same sort of kernel messages as in OpenBSD 7.0.)
Unfortunately using config(8) this way conflicts with OpenBSD's KARL kernel relinking. Enabling the ipmi driver this way will survive reboots (or it has so far for me), but it will apparently be lost if you use syspatch to apply at least kernel patches and perhaps any patch.
Once your IPMI is configured under any OpenBSD version, you can do
at least two new things. The first is that you can see IPMI sensors
in 'sysctl hw.sensors
', usually under hw.sensors.ipmi0. OpenBSD seems
to be able to read IPMI sensors quite readily and without delays,
which is a nice change from the usual Linux situation. The output of
this on one of our machines looks like:
hw.sensors.ipmi0.temp0=26.00 degC (CPU Temp), OK hw.sensors.ipmi0.temp1=32.00 degC (PCH Temp), OK [...] hw.sensors.ipmi0.fan0=9800 RPM (FAN1), OK [...] hw.sensors.ipmi0.volt0=12.29 VDC (12V), OK hw.sensors.ipmi0.volt1=5.12 VDC (5VCC), OK [...] hw.sensors.ipmi0.indicator0=Off (Chassis Intru), OK
(Unfortunately, the Prometheus host agent currently doesn't read and report any of the hw.sensors sysctls. As always, the sensors you get will vary between server models and not all of them may make sense or be valid.)
The second thing is that you can install and use ipmitool, with it working the same as on Linux (and probably other *BSDs). Ipmitool comes from OpenBSD's ports collection and can be added with pkg_add. Once installed it will automatically use the /dev/ipmi0 device that OpenBSD has set up and everything just works. This can let you take an OpenBSD machine's IPMI from an unconfigured state to being up and on your management network without having to take the machine down into BIOS (although you do have to reboot at least once).
(In theory, you can also do things like control what will happen to the machine if power goes out and then comes back on. Your mileage may vary as to whether your BMC really supports this portion of IPMI and it works right.)
Where Linux's load average comes from in the kernel
Suppose, not hypothetically, that you have a machine that periodically has its load average briefly soar to relatively absurd levels for no obvious reason; the machine is normally at, say, 0.5 load average but briefly spikes to 10 or 15 a number of times a day. You would like to know why this happens. Capturing 'top' output once a minute doesn't show anything revealing, and since these spikes are unpredictable it's difficult to watch top continuously to possibly catch one in action. A starting point is to understand how the Linux kernel puts the load average together.
(This is on a multi-user login machine with a lot of people logged in, so one obvious general hypothesis is that there is some per-user background daemon or process that periodically wakes up and sometimes they're all synchronized together, creating a brief load spike as they all try to get scheduled and run at once.)
The core calculations are in kernel/sched/loadavg.c.
As lots of things will tell you, the load average is "an exponentially
decaying average" of a series of instantaneous samples. These samples
are taken at intervals of LOAD_FREQ
(currently set to 5 seconds
in include/linux/sched/loadavg.h).
To simplify a complicated implementation, the samples are the sum
of the per-CPU counts of the number of running tasks and uninterruptible
tasks (nr_running and nr_uninterruptible). The every five
second load average calculation doesn't compute these two counts
on the fly; instead they're maintained by the general kernel
scheduling code and then sampled. This means that if we somehow hook into
this periodic sampling with things like eBPF, we can't see the exact tasks
and programs involved in creating the load average; more or less the best
we could do would be to see the total numbers at each sample.
(This would already tell us more information than is generally exposed by 'top' and the load average; if nothing else, it might tell us how fast the spike happens, how high it reaches, and whether it's running tasks or tasks waiting on IO.)
When a task adds to or reduces the number of uninterruptible tasks is somewhat accessible. A task exits this state in ttwu_do_activate() in kernel/sched/core.c under conditions that are probably accessible to eBPF. A task increases the number of uninterruptible tasks in the depths of __schedule(), depths which don't seem amenable to hooking by eBPF; however I think you might be able to conditionally hook deactivate_task() to record this.
As you might expect, tasks become runnable or stop being runnable all somewhat all over the place, and the code that tracks and implements this is distributed around a number of kernel functions, some of them inlined ones from headers (plus, the kernel has more than one scheduler). It's not clear to me if there's any readily accessible eBPF tracepoints or kernel functions that could be used to hook into when a particular task becomes runnable or stops being runnable. There does seem to be a scheduler tracepoint for when this number changes, but I'm not certain if you can extract the task information from the tracepoints (and I think sometimes a number of tasks can become runnable all at once).
The current instant value of nr_running is exposed in /proc/loadavg, as covered in proc(5), and as a result often makes it into metrics systems. I don't think nr_uninterruptible is exposed anywhere that's readily accessible. However, cgroups v1 does report it through the general taskstats accounting interface, sort of per cgroupstats.rst, with the disclaimer that I'm not sure how this interacts with systemd's cgroups stunts. The kernel's tools/accounting/getdelays.c does seem to work system-wide, if you run it as eg 'getdelays -C /sys/fs/cgroup/cpu,cpuacct', but you need to build the right version of it; the version in the kernel's current source tree may not compile on older kernels.
Having gone through all of this, what I've learned is that tracing this area with eBPF is probably too much work, but we could probably get a fair with dumping basic process information every few seconds, since the load average is only updated every five seconds or so and what matters is the current state of things close to that time. Sadly I don't think the kernel offers a way to get a bulk dump of current processes and their states, say via netlink; instead I think you have to go through /proc yourself (or let ps do it with an appropriate output format).