Wandering Thoughts archives

2024-03-31

Some thoughts on switching daemons to be socket activated via systemd

Socket activation is a systemd feature for network daemons where systemd is responsible for opening and monitoring the Internet or local socket for a daemon, and it only starts the actual daemon when a client connects. This behavior mimics the venerable inetd but with rather more sophistication and features. A number of Linux distributions are a little bit in love with switching various daemons over to being socket activated this way, from the traditional approach where the daemon handles listening for connections itself (along with the sockets involved). Sometimes this goes well, and sometimes it doesn't.

There are a number of advantages to having a service (a daemon) activated by systemd through socket activation instead of running all the time:

  • Services can simplify their startup ordering because their socket is ready (and other services can start trying to talk to it) before the daemon itself is ready. In fact, systemd can reliably know when a socket is 'ready' instead of having to guess when a service has gotten that far in its startup.

  • Heavy-weight daemons don't have to be started until they're actually needed. As a consequence, these daemons and their possibly slow startup don't delay the startup of the overall system.

  • The service (daemon) responsible for handling a particular socket can often be restarted or swapped around without clients having to care.

  • The daemon responsible for the service can shut down after a while if there's no activity, reducing resource usage on the host; since systemd still has the socket active, the service will just get restarted if there's a new client that wants to talk to it.

Socket activated daemons don't have to ever time out and exit on their own; they can hang around until restarted or explicitly stopped if they want to. But it's common to make them exit on their own after a timeout, since this is seen as a general benefit. Often this is actually convenient, especially on typical systems. For example, I believe many libvirt daemons exit if they're unused; on my Fedora workstations, this means they're often not running (I'm usually not running VMs on my desktops).

Apart from another systemd unit and the daemon having a deeper involvement with systemd, the downside of socket activation is that your daemon isn't immediately started and sometimes it may not be running. The advantage of daemons immediately starting on boot is that you know right away whether or not they could start, and if they're always running you don't have to worry about whether they'll restart under the system's current conditions (and perhaps some updated configuration settings). If the daemon has an expensive startup process, socket activation can mean that you have to wait for that on the first connection (or the first connection after things go idle), as systemd starts the daemon to handle your connection and the daemon goes through its startup.

Similarly, having the theoretical possibility for a daemon to exit if it's unused for long enough doesn't matter if it will never be unused for that long once it starts. For example, if a daemon has a deactivation timeout of two minutes of idleness and your system monitoring connects to it for a health check every 59 seconds, it's never going to time out (and it's going to be started very soon after the system boots, when the first post-boot health check happens).

PS: If you want to see all currently enabled systemd socket activations on your machine, you want 'systemctl list-sockets'. Most of them will be local (Unix) sockets.

SystemdSocketActivationThoughts written at 22:33:49;

2024-03-22

The Linux kernel.task_delayacct sysctl and why you might care about it

If you run a recent enough version of iotop on a typical Linux system, it may nag at you to the effect of:

CONFIG_TASK_DELAY_ACCT and kernel.task_delayacct sysctl not enabled in kernel, cannot determine SWAPIN and IO %

You might wonder whether you should turn on this sysctl, how much you care, and why it was defaulted to being disabled in the first place.

This sysctl enables (Task) Delay accounting, which tracks things like how long things wait for the CPU or wait for their IO to complete on a per-task basis (which in Linux means 'thread', more or less). General system information will provide you an overall measure of this in things like 'iowait%' and pressure stall information, but those are aggregates; you may be interested in known things like how much specific processes are being delayed or are waiting for IO.

(Also, overall system iowait% is a conservative measure and won't give you a completely accurate picture of how much processes are waiting for IO. You can get per-cgroup pressure stall information, which in some cases can come close to a per-process number.)

In the context of iotop specifically, the major thing you will miss is 'IO %', which is the percent of the time that a particular process is waiting for IO. Task delay accounting can give you information about per-process (or task) run queue latency but I don't know if there are any tools similar to iotop that will give you this information. There is a program in the kernel source, tools/accounting/getdelays.c, that will dump the raw information on a one-time basis (and in some versions, compute averages for you, which may be informative). The (current) task delay accounting information you can theoretically get is documented in comments in include/uapi/linux/taskstats.h, or this version in the documentation. You may also want to look at include/linux/delayacct.h, which I think is the kernel internal version that tracks this information.

(You may need the version of getdelays.c from your kernel's source tree, as the current version may not be backward compatible to your kernel. This typically comes up as compile errors, which are at least obvious.)

How you can access this information yourself is sort of covered in Per-task statistics interface, but in practice you'll want to read the source code of getdelays.c or the Python source code of iotop. If you specifically want to track how long a task spends delaying for IO, there is also a field for it in /proc/<pid>/stat; per proc(5), field 42 is delayacct_blkio_ticks. As far as I can tell from the kernel source, this is the same information that the netlink interface will provide, although it only has the total time waiting for 'block' (filesystem) IO and doesn't have the count of block IO operations.

Task delay accounting can theoretically be requested on a per-cgroup basis (as I saw in a previous entry on where the Linux load average comes from), but in practice this only works for cgroup v1. This (task) delay accounting has never been added to cgroup v2, which may be a sign that the whole feature is a bit neglected. I couldn't find much to say why delay accounting was changed (in 2021) to default to being off. The commit that made this change seems to imply it was defaulted to off on the assumption that it wasn't used much. Also see this kernel mailing list message and this reddit thread.

Now that I've discovered kernel.task_delayacct and played around with it a bit, I think it's useful enough for us for diagnosing issues that we're going to turn it on by default until and unless we see problems (performance or otherwise). Probably I'll stick to doing this with an /etc/sysctl.d/ drop in file, because I think that gets activated early enough in boot to cover most processes of interest.

(As covered somewhere, if you turn delay accounting on through the sysctl, it apparently only covers processes that were started after the sysctl was changed. Processes started before have no delay accounting information, or perhaps only 'CPU' delay accounting information. One such process is init, PID 1, which will always be started before the sysctl is set.)

PS: The per-task IO delays do include NFS IO, just as iowait does, which may make it more interesting if you have NFS clients. Sometimes it's obvious which programs are being affected by slow NFS servers, but sometimes not.

TaskDelayAccountingNotes written at 23:09:37;

2024-03-21

Reading the Linux cpufreq sysfs interface is (deliberately) slow

The Linux kernel has a CPU frequency (management) system, called cpufreq. As part of this, Linux (on supported hardware) exposes various CPU frequency information under /sys/devices/system/cpu, as covered in Policy Interface in sysfs. Reading these files can provide you with some information about the state of your system's CPUs, especially their current frequency (more or less). This information is considered interesting enough that the Prometheus host agent collects (some) cpufreq information by default. However, there is a little caution, which is that apparently the kernel deliberately slows down reading this information from /sys (as I learned recently. A comment in the relevant Prometheus code says that this delay is 50 milliseconds, but this comment dates from 2019 and may be out of date now (I wasn't able to spot the slowdown in the kernel code itself).

On a machine with only a few CPUs, reading this information is probably not going to slow things down enough that you really notice. On a machine with a lot of CPUs, the story can be very different. We have one AMD 512-CPU machine, and on this machine reading every CPU's scaling_cur_freq one at a time takes over ten seconds:

; cd /sys/devices/system/cpu/cpufreq
; time cat policy*/scaling_cur_freq >/dev/null
10.25 real 0.07 user 0.00 kernel

On a 112-CPU Xeon Gold server, things are not so bad at 2.24 seconds; a 128-Core AMD takes 2.56 seconds. A 64-CPU server is down to 1.28 seconds, a 32-CPU one 0.64 seconds, and on my 16-CPU and 12-CPU desktops (running Fedora instead of Ubuntu) the time is reported as '0.00 real'.

This potentially matters on high-CPU machines where you're running any sort of routine monitoring that tries to read this information, including the Prometheus host agent in its default configuration. The Prometheus host agent reduces the impact of this slowdown somewhat, but it's still noticeably slower to collect all of the system information if we have the 'cpufreq' collector enabled on these machines. As a result of discovering this, I've now disabled the Prometheus host agent's 'cpufreq' collector on anything with 64 cores or more, and we may reduce that in the future. We don't have a burning need to see CPU frequency information and we would like to avoid slow data collection and occasional apparent impacts on the rest of the system.

(Typical Prometheus configurations magnify the effect of the slowdown because it's common to query ('scrape') the host agent quite often, for example every fifteen seconds. Every time you do this, the host agent re-reads these cpufreq sysfs files and hits this delay.)

PS: I currently have no views on how useful the system's CPU frequencies are as a metric, and how much they might be perturbed by querying them (although the Prometheus host agent deliberately pretends it's running on a single-CPU machine, partly to avoid problems in this area). If you do, you might either universally not collect CPU frequency information or take the time impact to do so even on high-CPU machines.

CpufreqSlowToRead written at 23:09:03;

2024-03-18

Sorting out PIDs, Tgids, and tasks on Linux

In the beginning, Unix only had processes and processes had process IDs (PIDs), and life was simple. Then people added (kernel-supported) threads, so processes could be multi-threaded. When you add threads, you need to give them some user-visible identifier. There are many options for what this identifier is and how it works (and how threads themselves work inside the kernel). The choice Linux made was that threads were just processes (that shared more than usual with other processes), and so their identifier was a process ID, allocated from the same global space of process IDs as regular independent processes. This has created some ambiguity in what programs and other tools mean by 'process ID' (including for me).

The true name for what used to be a 'process ID', which is to say the PID of the overall entity that is 'a process with all its threads', is a TGID (Thread or Task Group ID). The TGID of a process is the PID of the main thread; a single-threaded program will have a TGID that is the same as its PID. You can see this in the 'Tgid:' and 'Pid:' fields of /proc/<PID>/status. Although some places will talk about 'pids' as separate from 'tids' (eg some parts of proc(5)), the two types are both allocated from the same range of numbers because they're both 'PIDs'. If I just give you a 'PID' with no further detail, there's no way to know if it's a process's PID or a task's PID.

In every /proc/<PID> directory, there is a 'tasks' subdirectory; this contains the PIDs of all tasks (threads) that are part of the thread group (ie, have the same TGID). All PIDs have a /proc/<PID> directory, but for convenience things like 'ls /proc' only lists the PIDs of processes (which you can think of as TGIDs). The /proc/<PID> directories for other tasks aren't returned by the kernel when you ask for the directory contents of /proc, although you can use them if you access them directly (and you can also access or discover them through /proc/<PID>/tasks). I'm not sure what information in the /proc/<PID> directories for tasks are specific to the task itself or are in total across all tasks in the TGID. The proc(5) manual page sometimes talks about processes and sometimes about tasks, but I'm not sure that's comprehensive.

(Much of the time when you're looking at what is actually a TGID, you want the total information across all threads in the TGID. If /proc/<PID> always gave you only task information even for the 'process' PID/TGID, multi-threaded programs could report confusingly low numbers for things like CPU usage unless you went out of your way to sum /proc/<PID>/tasks/* information yourself.)

Various tools will normally return the PID (TGID) of the overall process, not the PID of a random task in a multi-threaded process. For example 'pidof <thing>' behaves this way. Depending on how the specific process works, this may or may not be the 'main thread' of the program (some multi-threaded programs more or less park their initial thread and do their main work on another one created later), and the program may not even have such a thing (I believe Go programs mostly don't, as they multiplex goroutines on to actual threads as needed).

If a tool or system offers you the choice to work on or with a 'PID' or a 'TGID', you are being given the choice to work with a single thread (task) or the overall process. Which one you want depends on what you're doing, but if you're doing things like asking for task delay information, using the TGID may better correspond to what you expect (since it will be the overall information for the entire process, not information for a specific thread). If a program only talks about PIDs, it's probably going to operate on or give you information about the entire process by default, although if you give it the PID of a task within the process (instead of the PID that is the TGID), you may get things specific to that task.

In a kernel context such as eBPF programs, I think you'll almost always want to track things by PID, not TGID. It is PIDs that do things like experience run queue scheduling latency, make system calls, and incur block IO delays, not TGIDs. However, if you're selecting what to report on, monitor, and so on, you'll most likely want to match on the TGID, not the PID, so that you report on all of the tasks in a multi-threaded program, not just one of them (unless you're specifically looking at tasks/threads, not 'a process').

(I'm writing this down partly to get it clear in my head, since I had some confusion recently when working with eBPF programs.)

PidsTgidsAndTasks written at 21:59:58;

2024-03-16

Some more notes on Linux's ionice and kernel IO priorities

In the long ago past, Linux gained some support for block IO priorities, with some limitations that I noticed the first time I looked into this. These days the Linux kernel has support for more IO scheduling and limitations, for example in cgroups v2 and its IO controller. However ionice is still there and now I want to note some more things, since I just looked at ionice again (for reasons outside the scope of this entry).

First, ionice and the IO priorities it sets are specifically only for read IO and synchronous write IO, per ioprio_set(2) (this is the underlying system call that ionice uses to set priorities). This is reasonable, since IO priorities are attached to processes and asynchronous write IO is generally actually issued by completely different kernel tasks and in situations where the urgency of doing the write is unrelated to the IO priority of the process that originally did the write. This is a somewhat unfortunate limitation since often it's write IO that is the slowest thing and the source of the largest impacts on overall performance.

IO priorities are only effective with some Linux kernel IO schedulers, such as BFQ. For obvious reasons they aren't effective with the 'none' scheduler, which is also the default scheduler for NVMe drives. I'm (still) unable to tell if IO priorities work if you're using software RAID instead of sitting your (supported) filesystem directly on top of a SATA, SAS, or NVMe disk. I believe that IO priorities are unlikely to work with ZFS, partly because ZFS often issues read IOs through its own kernel threads instead of directly from your process and those kernel threads probably aren't trying to copy around IO priorities.

Even if they pass through software RAID, IO priorities apply at the level of disk devices (of course). This means that each side of a software RAID mirror will do IO priorities only 'locally', for IO issued to it, and I don't believe there will be any global priorities for read IO to the overall software RAID mirror. I don't know if this will matter in practice. Since IO priorities only apply to disks, they obviously don't apply (on the NFS client) to NFS read IO. Similarly, IO priorities don't apply to data read from the kernel's buffer/page caches, since this data is already in RAM and doesn't need to be read from disk. This can give you an ionice'd program that is still 'reading' lots of data (and that data will be less likely to be evicted from kernel caches).

Since we mostly use some combination of software RAID, ZFS, and NFS, I don't think ionice and IO priorities are likely to be of much use for us. If we want to limit the impact a program's IO has on the rest of the system, we need different measures.

IoniceNotesII written at 23:03:23;

2024-03-13

Restarting systemd-networkd normally clears your 'ip rules' routing policies

Here's something that I learned recently: if systemd-networkd restarts, for example because of a package update for it that includes an automatic daemon restart, it will clear your 'ip rules' routing policies (and also I think your routing table, although you may not notice that much). If you've set up policy based routing of your own (or some program has done that as part of its operation), this may produce unpleasant surprises.

Systemd-networkd does this fundamentally because you can set ip routing policies in .network files. When networkd is restarted, one of the things it does is re-set-up whatever routing policies you specified; if you didn't specify any, it clears them. This is a reasonably sensible decision, both to deal with changes from previously specified routing policies and to also give people a way to clean out their experiments and reset to a known good base state. Similar logic applies to routes.

This can be controlled through networkd.conf and its drop-in files, by setting ManageForeignRoutingPolicyRules=no and perhaps ManageForeignRoutes=no. Without testing it through a networkd restart, I believe that the settings I want are:

[Network]
ManageForeignRoutingPolicyRules=no
ManageForeignRoutes=no

The minor downside of this for me is that certain sorts of route updates will have to be done by hand, instead of by updating .network files and then restarting networkd.

While having an option to do this sort of clearing is sensible, I am dubious about the current default. In practice, coherently specifying routing policies through .network files is so much of a pain that I suspect that few people do it that way; instead I suspect that most people either script it to issue the 'ip rule' commands (as I do) or use software that does it for them (and I know that such software exists). It would be great if networkd could create and manage high level policies for you (such as isolated interfaces), but the current approach is both verbose and limited in what you can do with it.

(As far as I know, networkd can't express rules for networks that can be brought up and torn down, because it's not an event-based system where you can have it react to the appearance of an interface or a configured network. It's possible I'm wrong, but if so it doesn't feel well documented.)

All of this is especially unfortunate on Ubuntu servers, which normally configure their networking through netplan. Netplan will more or less silently use networkd as the backend to actually implement what you wrote in your Netplan configuration, leaving you exposed to this, and on top of that Netplan itself has limitations on what routing policies you can express (pushing you even more towards running 'ip rule' yourself).

SystemdNetworkdResetsIpRules written at 22:18:11;

2024-03-10

Scheduling latency, IO latency, and their role in Linux responsiveness

One of the things that I do on my desktops and our servers is collect metrics that I hope will let me assess how responsive our systems are when people are trying to do things on them. For a long time I've been collecting disk IO latency histograms, and recently I've been collecting runqueue latency histograms (using the eBPF exporter and a modified version of libbpf/tools/runqlat.bpf.c). This has caused me to think about the various sorts of latency that affects responsiveness and how I can measure it.

Run queue latency is the latency between when a task becomes able to run (or when it got preempted in the middle of running) and when it does run. This latency is effectively the minimum (lack of) response from the system and is primarily affected by CPU contention, since the major reason tasks have to wait to run is other tasks using the CPU. For obvious reasons, high(er) run queue latency is related to CPU pressure stalls, but a histogram can show you more information than an aggregate number. I expect run queue latency to be what matters most for a lot of programs that mostly talk to things over some network (including talking to other programs on the same machine), and perhaps some of their time burning CPU furiously. If your web browser can't get its rendering process running promptly after the HTML comes in, or if it gets preempted while running all of that Javascript, this will show up in run queue latency. The same is true for your window manager, which is probably not doing much IO.

Disk IO latency is the lowest level indicator of things having to wait on IO; it sets a lower bound on how little latency processes doing IO can have (assuming that they do actual disk IO). However, direct disk IO is only one level of the Linux IO system, and the Linux IO system sits underneath filesystems. What actually matters for responsiveness and latency is generally how long user-level filesystem operations take. In an environment with sophisticated, multi-level filesystems that have complex internal behavior (such as ZFS and its ZIL), the actual disk IO time may only be a small portion of the user-level timing, especially for things like fsync().

(Some user-level operations may also not do any disk IO at all before they return from the kernel (for example). A read() might be satisfied from the kernel's caches, and a write() might simply copy the data into the kernel and schedule disk IO later. This is where histograms and related measurements become much more useful than averages.)

Measuring user level filesystem latency can be done through eBPF, to at least some degree; libbpf-tools/vfsstat.bpf.c hooks a number of kernel vfs_* functions in order to just count them, and you could convert this into some sort of histogram. Doing this on a 'per filesystem mount' basis is probably going to be rather harder. On the positive side for us, hooking the vfs_* functions does cover the activity a NFS server does for NFS clients as well as truly local user level activity. Because there are a number of systems where we really do care about the latency that people experience and want to monitor it, I'll probably build some kind of vfs operation latency histogram eBPF exporter program, although most likely only for selected VFS operations (since there are a lot of them).

I think that the straightforward way of measuring user level IO latency (by tracking the time between entering and exiting a top level vfs_* function) will wind up including run queue latency as well. You will get, basically, the time it takes to prepare and submit the IO inside the kernel, the time spent waiting for it, and then after the IO completes the time the task spends waiting inside the kernel before it's able to run.

Because of how Linux defines iowait, the higher your iowait numbers are, the lower the run queue latency portion of the total time will be, because iowait only happens on idle CPUs and idle CPUs are immediately available to run tasks when their IO completes. You may want to look at io pressure stall information for a more accurate track of when things are blocked on IO.

A complication of measuring user level IO latency is that not all user visible IO happens through read() and write(). Some of it happens through accessing mmap()'d objects, and under memory pressure some of it will be in the kernel paging things back in from wherever they wound up. I don't know if there's any particularly easy way to hook into this activity.

SystemResponseLatencyMetrics written at 23:31:46;

2024-03-07

Some notes about the Cloudflare eBPF Prometheus exporter for Linux

I've been a fan of the Cloudflare eBPF Prometheus exporter for some time, ever since I saw their example of per-disk IO latency histograms. And the general idea is extremely appealing; you can gather a lot of information with eBPF (usually from the kernel), and the ability to turn it into metrics is potentially quite powerful. However, actually using it has always been a bit arcane, especially if you were stepping outside the bounds of Cloudflare's canned examples. So here's some notes on the current version (which is more or less v2.4.0 as I write this), written in part for me in the future when I want to fiddle with eBPF-created metrics again.

If you build the ebpf_exporter yourself, you want to use their provided Makefile rather than try to do it directly. This Makefile will give you the choice to build a 'static' binary or a dynamic one (with 'make build-dynamic'); the static is the default. I put 'static' into quotes because of the glibc NSS problem; if you're on a glibc-using Linux, your static binary will still depend on your version of glibc. However, it will contain a statically linked libbpf, which will make your life easier. Unfortunately, building a static version is impossible on some Linux distributions, such as Fedora, because Fedora just doesn't provide static versions of some required libraries (as far as I can tell, libelf.a). If you have to build a dynamic executable, a normal ebpf_exporter build will depend on the libbpf shared library you can find in libbpf/dest/usr/lib. You'll need to set a LD_LIBRARY_PATH to find this copy of libbpf.so at runtime.

(You can try building with the system libbpf, but it may not be recent enough for ebpf_exporter.)

To get metrics from eBPF with ebpf_exporter, you need an eBPF program that collects the metrics and then a YAML configuration that tells ebpf_exporter how to handle what the eBPF program provides. The original version of ebpf_exporter had you specify eBPF programs in text in your (YAML) configuration file and then compiled them when it started. This approach has fallen out of favour, so now eBPF programs must be pre-compiled to special .o files that are loaded at runtime. I believe these .o files are relatively portable across systems; I've used ones built on Fedora 39 on Ubuntu 22.04. The simplest way to build either a provided example or your own one is to put it in the examples directory and then do 'make <name>.bpf.o'. Running 'make' in the examples directory will build all of the standard examples.

To run an eBPF program or programs, you copy their <name>.bpf.o and <name>.yaml to a configuration directory of your choice, specify this directory in theebpf_exporter '--config.dir' argument, and then use '--config.names=<name>,<name2>,...' to say what programs to run. The suffix of the YAML configuration file and the eBPF object file are always fixed.

The repository has some documentation on the YAML (and eBPF) that you have to write to get metrics. However, it is probably not sufficient to explain how to modify the examples or especially to write new ones. If you're doing this (for example, to revive an old example that was removed when the exporter moved to the current pre-compiled approach), you really want to read over existing examples and then copy their general structure more or less exactly. This is especially important because the main ebpf_exporter contains some special handling for at least histograms that assumes things are being done as in their examples. When reading examples, it helps to know that Cloudflare has a bunch of helpers that are in various header files in the examples directory. You want to use these helpers, not the normal, standard bpf helpers.

(However, although not documented in bpf-helpers(7), '__sync_fetch_and_add()' is a standard eBPF thing. It is not so much documented as mentioned in some kernel BPF documentation on arrays and maps and in bpf(2).)

One source of (e)BPF code to copy from that is generally similar to what you'll write for ebpf_exporter is bcc/libbpf-tools (in the <name>.bpf.c files). An eBPF program like runqlat.bpf.c will need restructuring to be used as an ebpf_exporter program, but it will show you what you can hook into with eBPF and how. Often these examples will be more elaborate than you need for ebpf_exporter, with more options and the ability to narrowly select things; you can take all of that out.

(When setting up things like the number of histogram slots, be careful to copy exactly what the examples do in both your .bpf.c and in your YAML, mysterious '+ 1's and all.)

EbpfExporterNotes written at 23:01:56;

2024-03-06

Where and how Ubuntu kernels get their ZFS modules

One of the interesting and convenient things about Ubuntu for people like us is that they provide pre-built and integrated ZFS kernel modules in their mainline kernels. If you want ZFS on your (our) ZFS fileservers, you don't have to add any extra PPA repositories or install any extra kernel module packages; it's just there. However, this leaves us with a little mystery, which is how the ZFS modules actually get there. The reason this is a mystery is that the ZFS modules are not in the Ubuntu kernel source, or at least not in the package source.

(One reason this matters is that you may want to see what patches Ubuntu has applied to their version of ZFS, because Ubuntu periodically backports patches to specific issues from upstream OpenZFS. If you go try to find ZFS patches, ZFS code, or a ZFS changelog in the regular Ubuntu kernel source, you will likely fail, and this will not be what you want.)

Ubuntu kernels are normally signed in order to work with Secure Boot. If you use 'apt source ...' on a signed kernel, what you get is not the kernel source but a 'source' that fetches specific unsigned kernels and does magic to sign them and generate new signed binary packages. To actually get the kernel source, you need to follow the directions in Build Your Own Kernel to get the source of the unsigned kernel package. However, as mentioned this kernel source does not include ZFS.

(You may be tempted to fetch the Git repository following the directions in Obtaining the kernel sources using git, but in my experience this may well leave you hunting around in confusing to try to find the branch that actually corresponds to even the current kernel for an Ubuntu release. Even if you have the Git repository cloned, downloading the source package can be easier.)

How ZFS modules get into the built Ubuntu kernel is that during the package build process, the Ubuntu kernel build downloads or copies a specific zfs-dkms package version and includes it in the tree that kernel modules are built from, which winds up including the built ZFS kernel modules in the binary kernel packages. Exactly what version of zfs-dkms will be included is specified in debian/dkms-versions, although good luck finding an accurate version of that file in the Git repository on any predictable branch or in any predictable location.

(The zfs-dkms package itself is the DKMS version of kernel ZFS modules, which means that it packages the source code of the modules along with directions for how DKMS should (re)build the binary kernel modules from the source.)

This means that if you want to know what specific version of the ZFS code is included in any particular Ubuntu kernel and what changed in it, you need to look at the source package for zfs-dkms, which is called zfs-linux and has its Git repository here. Don't ask me how the branches and tags in the Git repository are managed and how they correspond to released package versions. My current view is that I will be downloading specific zfs-linux source packages as needed (using 'apt source zfs-linux').

The zfs-linux source package is also used to build the zfsutils-linux binary package, which has the user space ZFS tools and libraries. You might ask if there is anything that makes zfsutils-linux versions stay in sync with the zfs-dkms versions included in Ubuntu kernels. The answer, as far as I can see, is no. Ubuntu is free to release new versions of zfsutils-linux and thus zfs-linux without updating the kernel's dkms-versions file to use the matching zfs-dkms version. Sufficiently cautious people may want to specifically install a matching version of zfsutils-linux and then hold the package.

I was going to write something about how you get the ZFS source for a particular kernel version, but it turns out that there is no straightforward way. Contrary to what the Ubuntu documentation suggests, if you do 'apt source linux-image-unsigned-$(uname -r)', you don't get the source package for that kernel version, you get the source package for the current version of the 'linux' kernel package, at whatever is the latest released version. Similarly, while you can inspect that source to see what zfs-dkms version it was built with, 'apt get source zfs-dkms' will only give you (easy) access to the current version of the zfs-linux source package. If you ask for an older version, apt will probably tell you it can't find it.

(Presumably Ubuntu has old source packages somewhere, but I don't know where.)

UbuntuKernelsZFSWhereFrom written at 22:59:21;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.