2024-03-18
Sorting out PIDs, Tgids, and tasks on Linux
In the beginning, Unix only had processes and processes had process IDs (PIDs), and life was simple. Then people added (kernel-supported) threads, so processes could be multi-threaded. When you add threads, you need to give them some user-visible identifier. There are many options for what this identifier is and how it works (and how threads themselves work inside the kernel). The choice Linux made was that threads were just processes (that shared more than usual with other processes), and so their identifier was a process ID, allocated from the same global space of process IDs as regular independent processes. This has created some ambiguity in what programs and other tools mean by 'process ID' (including for me).
The true name for what used to be a 'process ID', which is to say
the PID of the overall entity that is 'a process with all its
threads', is a TGID (Thread or Task Group ID). The TGID of a
process is the PID of the main thread; a single-threaded program
will have a TGID that is the same as its PID. You can see this in
the 'Tgid:' and 'Pid:' fields of /proc/<PID>/status. Although some
places will talk about 'pids' as separate from 'tids' (eg some parts
of proc(5)
),
the two types are both allocated from the same range of numbers
because they're both 'PIDs'. If I just give you a 'PID' with no
further detail, there's no way to know if it's a process's PID or
a task's PID.
In every /proc/<PID> directory, there is a 'tasks' subdirectory;
this contains the PIDs of all tasks (threads) that are part of
the thread group (ie, have the same TGID). All PIDs have a /proc/<PID>
directory, but for convenience things like 'ls /proc' only lists
the PIDs of processes (which you can think of as TGIDs). The
/proc/<PID> directories for other tasks aren't returned by the
kernel when you ask for the directory contents of /proc, although
you can use them if you access them directly (and you can also
access or discover them through /proc/<PID>/tasks). I'm not sure
what information in the /proc/<PID> directories for tasks are
specific to the task itself or are in total across all tasks in the
TGID. The proc(5)
manual page sometimes talks about processes and sometimes about
tasks, but I'm not sure that's comprehensive.
(Much of the time when you're looking at what is actually a TGID, you want the total information across all threads in the TGID. If /proc/<PID> always gave you only task information even for the 'process' PID/TGID, multi-threaded programs could report confusingly low numbers for things like CPU usage unless you went out of your way to sum /proc/<PID>/tasks/* information yourself.)
Various tools will normally return the PID (TGID) of the overall process, not the PID of a random task in a multi-threaded process. For example 'pidof <thing>' behaves this way. Depending on how the specific process works, this may or may not be the 'main thread' of the program (some multi-threaded programs more or less park their initial thread and do their main work on another one created later), and the program may not even have such a thing (I believe Go programs mostly don't, as they multiplex goroutines on to actual threads as needed).
If a tool or system offers you the choice to work on or with a 'PID' or a 'TGID', you are being given the choice to work with a single thread (task) or the overall process. Which one you want depends on what you're doing, but if you're doing things like asking for task delay information, using the TGID may better correspond to what you expect (since it will be the overall information for the entire process, not information for a specific thread). If a program only talks about PIDs, it's probably going to operate on or give you information about the entire process by default, although if you give it the PID of a task within the process (instead of the PID that is the TGID), you may get things specific to that task.
In a kernel context such as eBPF programs, I think you'll almost always want to track things by PID, not TGID. It is PIDs that do things like experience run queue scheduling latency, make system calls, and incur block IO delays, not TGIDs. However, if you're selecting what to report on, monitor, and so on, you'll most likely want to match on the TGID, not the PID, so that you report on all of the tasks in a multi-threaded program, not just one of them (unless you're specifically looking at tasks/threads, not 'a process').
(I'm writing this down partly to get it clear in my head, since I had some confusion recently when working with eBPF programs.)
2024-03-16
Some more notes on Linux's ionice
and kernel IO priorities
In the long ago past, Linux gained some support for block IO
priorities,
with some limitations that I noticed the first time I looked into
this. These days the Linux kernel has support for
more IO scheduling and limitations, for example in cgroups v2 and its IO
controller.
However ionice
is still there and now I want to note some more things, since I
just looked at ionice again (for reasons outside the scope of this
entry).
First, ionice
and the IO priorities it sets are specifically
only for read IO and synchronous write IO, per ioprio_set(2)
(this is
the underlying system call that ionice
uses to set priorities).
This is reasonable, since IO priorities are attached to processes
and asynchronous write IO is generally actually issued by completely
different kernel tasks and in situations where the urgency of doing
the write is unrelated to the IO priority of the process that
originally did the write. This is a somewhat unfortunate limitation
since often it's write IO that is the slowest thing and the source
of the largest impacts on overall performance.
IO priorities are only effective with some Linux kernel IO schedulers, such as BFQ. For obvious reasons they aren't effective with the 'none' scheduler, which is also the default scheduler for NVMe drives. I'm (still) unable to tell if IO priorities work if you're using software RAID instead of sitting your (supported) filesystem directly on top of a SATA, SAS, or NVMe disk. I believe that IO priorities are unlikely to work with ZFS, partly because ZFS often issues read IOs through its own kernel threads instead of directly from your process and those kernel threads probably aren't trying to copy around IO priorities.
Even if they pass through software RAID, IO priorities apply at the level of disk devices (of course). This means that each side of a software RAID mirror will do IO priorities only 'locally', for IO issued to it, and I don't believe there will be any global priorities for read IO to the overall software RAID mirror. I don't know if this will matter in practice. Since IO priorities only apply to disks, they obviously don't apply (on the NFS client) to NFS read IO. Similarly, IO priorities don't apply to data read from the kernel's buffer/page caches, since this data is already in RAM and doesn't need to be read from disk. This can give you an ionice'd program that is still 'reading' lots of data (and that data will be less likely to be evicted from kernel caches).
Since we mostly use some combination
of software RAID, ZFS, and NFS, I don't think ionice
and IO priorities
are likely to be of much use for us. If we want to limit the impact a
program's IO has on the rest of the system, we need different measures.
2024-03-13
Restarting systemd-networkd normally clears your 'ip rules' routing policies
Here's something that I learned recently: if systemd-networkd restarts, for example because of a package update for it that includes an automatic daemon restart, it will clear your 'ip rules' routing policies (and also I think your routing table, although you may not notice that much). If you've set up policy based routing of your own (or some program has done that as part of its operation), this may produce unpleasant surprises.
Systemd-networkd does this fundamentally because you can set ip routing policies in .network files. When networkd is restarted, one of the things it does is re-set-up whatever routing policies you specified; if you didn't specify any, it clears them. This is a reasonably sensible decision, both to deal with changes from previously specified routing policies and to also give people a way to clean out their experiments and reset to a known good base state. Similar logic applies to routes.
This can be controlled through networkd.conf
and its drop-in files, by setting ManageForeignRoutingPolicyRules=no
and perhaps ManageForeignRoutes=no
.
Without testing it through a networkd restart, I believe that the
settings I want are:
[Network] ManageForeignRoutingPolicyRules=no ManageForeignRoutes=no
The minor downside of this for me is that certain sorts of route updates will have to be done by hand, instead of by updating .network files and then restarting networkd.
While having an option to do this sort of clearing is sensible, I am dubious about the current default. In practice, coherently specifying routing policies through .network files is so much of a pain that I suspect that few people do it that way; instead I suspect that most people either script it to issue the 'ip rule' commands (as I do) or use software that does it for them (and I know that such software exists). It would be great if networkd could create and manage high level policies for you (such as isolated interfaces), but the current approach is both verbose and limited in what you can do with it.
(As far as I know, networkd can't express rules for networks that can be brought up and torn down, because it's not an event-based system where you can have it react to the appearance of an interface or a configured network. It's possible I'm wrong, but if so it doesn't feel well documented.)
All of this is especially unfortunate on Ubuntu servers, which normally configure their networking through netplan. Netplan will more or less silently use networkd as the backend to actually implement what you wrote in your Netplan configuration, leaving you exposed to this, and on top of that Netplan itself has limitations on what routing policies you can express (pushing you even more towards running 'ip rule' yourself).
2024-03-10
Scheduling latency, IO latency, and their role in Linux responsiveness
One of the things that I do on my desktops and our servers is collect metrics that I hope will let me assess how responsive our systems are when people are trying to do things on them. For a long time I've been collecting disk IO latency histograms, and recently I've been collecting runqueue latency histograms (using the eBPF exporter and a modified version of libbpf/tools/runqlat.bpf.c). This has caused me to think about the various sorts of latency that affects responsiveness and how I can measure it.
Run queue latency is the latency between when a task becomes able to run (or when it got preempted in the middle of running) and when it does run. This latency is effectively the minimum (lack of) response from the system and is primarily affected by CPU contention, since the major reason tasks have to wait to run is other tasks using the CPU. For obvious reasons, high(er) run queue latency is related to CPU pressure stalls, but a histogram can show you more information than an aggregate number. I expect run queue latency to be what matters most for a lot of programs that mostly talk to things over some network (including talking to other programs on the same machine), and perhaps some of their time burning CPU furiously. If your web browser can't get its rendering process running promptly after the HTML comes in, or if it gets preempted while running all of that Javascript, this will show up in run queue latency. The same is true for your window manager, which is probably not doing much IO.
Disk IO latency is the lowest level indicator of things having to
wait on IO; it sets a lower bound on how little latency processes
doing IO can have (assuming that they do actual disk IO). However,
direct disk IO is only one level of the Linux IO system, and the
Linux IO system sits underneath filesystems. What actually matters
for responsiveness and latency is generally how long user-level
filesystem operations take. In an environment with sophisticated,
multi-level filesystems that have complex internal behavior (such
as ZFS and its ZIL), the actual disk
IO time may only be a small portion of the user-level timing,
especially for things like fsync()
.
(Some user-level operations may also not do any disk IO at all
before they return from the kernel (for example).
A read()
might be satisfied from the kernel's caches, and a
write()
might simply copy the data into the kernel and schedule
disk IO later. This is where histograms and related measurements
become much more useful than averages.)
Measuring user level filesystem latency can be done through eBPF, to at least some degree; libbpf-tools/vfsstat.bpf.c hooks a number of kernel vfs_* functions in order to just count them, and you could convert this into some sort of histogram. Doing this on a 'per filesystem mount' basis is probably going to be rather harder. On the positive side for us, hooking the vfs_* functions does cover the activity a NFS server does for NFS clients as well as truly local user level activity. Because there are a number of systems where we really do care about the latency that people experience and want to monitor it, I'll probably build some kind of vfs operation latency histogram eBPF exporter program, although most likely only for selected VFS operations (since there are a lot of them).
I think that the straightforward way of measuring user level IO latency (by tracking the time between entering and exiting a top level vfs_* function) will wind up including run queue latency as well. You will get, basically, the time it takes to prepare and submit the IO inside the kernel, the time spent waiting for it, and then after the IO completes the time the task spends waiting inside the kernel before it's able to run.
Because of how Linux defines iowait, the higher your iowait numbers are, the lower the run queue latency portion of the total time will be, because iowait only happens on idle CPUs and idle CPUs are immediately available to run tasks when their IO completes. You may want to look at io pressure stall information for a more accurate track of when things are blocked on IO.
A complication of measuring user level IO latency is that not all
user visible IO happens through read()
and write()
. Some of it
happens through accessing mmap()
'd objects, and under memory
pressure some of it will be in the kernel paging things back in
from wherever they wound up. I don't know if there's any particularly
easy way to hook into this activity.
2024-03-07
Some notes about the Cloudflare eBPF Prometheus exporter for Linux
I've been a fan of the Cloudflare eBPF Prometheus exporter for some time, ever since I saw their example of per-disk IO latency histograms. And the general idea is extremely appealing; you can gather a lot of information with eBPF (usually from the kernel), and the ability to turn it into metrics is potentially quite powerful. However, actually using it has always been a bit arcane, especially if you were stepping outside the bounds of Cloudflare's canned examples. So here's some notes on the current version (which is more or less v2.4.0 as I write this), written in part for me in the future when I want to fiddle with eBPF-created metrics again.
If you build the ebpf_exporter yourself, you want to use their
provided Makefile rather than try to do it directly. This Makefile
will give you the choice to build a 'static' binary or a dynamic
one (with 'make build-dynamic'); the static is the default. I put
'static' into quotes because of the glibc NSS problem; if you're on a glibc-using Linux, your
static binary will still depend on your version of glibc. However,
it will contain a statically linked libbpf, which will make your
life easier. Unfortunately, building a static version is impossible
on some Linux distributions, such as Fedora, because Fedora just
doesn't provide static versions of some required libraries (as far
as I can tell, libelf.a). If you have to build a dynamic executable,
a normal ebpf_exporter build will depend on the libbpf shared
library you can find in libbpf/dest/usr/lib. You'll need to set a
LD_LIBRARY_PATH
to find this copy of libbpf.so at runtime.
(You can try building with the system libbpf, but it may not be recent enough for ebpf_exporter.)
To get metrics from eBPF with ebpf_exporter, you need an eBPF
program that collects the metrics and then a YAML configuration
that tells ebpf_exporter how to handle what the eBPF program
provides. The original version of ebpf_exporter had you specify
eBPF programs in text in your (YAML) configuration file and then
compiled them when it started. This approach has fallen out of
favour, so now eBPF programs must be pre-compiled to special .o
files that are loaded at runtime. I believe these .o files are
relatively portable across systems; I've used ones built on Fedora
39 on Ubuntu 22.04. The simplest way to build either a provided
example or your own one is to put it in the examples
directory
and then do 'make <name>.bpf.o'. Running 'make' in the examples
directory will build all of the standard examples.
To run an eBPF program or programs, you copy their <name>.bpf.o and
<name>.yaml to a configuration directory of your choice, specify
this directory in theebpf_exporter '--config.dir
' argument,
and then use '--config.names=<name>,<name2>,...
' to say what
programs to run. The suffix of the YAML configuration file and the
eBPF object file are always fixed.
The repository has some documentation on the YAML (and eBPF) that you have to write to get metrics. However, it is probably not sufficient to explain how to modify the examples or especially to write new ones. If you're doing this (for example, to revive an old example that was removed when the exporter moved to the current pre-compiled approach), you really want to read over existing examples and then copy their general structure more or less exactly. This is especially important because the main ebpf_exporter contains some special handling for at least histograms that assumes things are being done as in their examples. When reading examples, it helps to know that Cloudflare has a bunch of helpers that are in various header files in the examples directory. You want to use these helpers, not the normal, standard bpf helpers.
(However, although not documented in bpf-helpers(7),
'__sync_fetch_and_add()
' is a standard eBPF thing. It is not
so much documented as mentioned in some kernel BPF documentation
on arrays and maps
and in bpf(2).)
One source of (e)BPF code to copy from that is generally similar to what you'll write for ebpf_exporter is bcc/libbpf-tools (in the <name>.bpf.c files). An eBPF program like runqlat.bpf.c will need restructuring to be used as an ebpf_exporter program, but it will show you what you can hook into with eBPF and how. Often these examples will be more elaborate than you need for ebpf_exporter, with more options and the ability to narrowly select things; you can take all of that out.
(When setting up things like the number of histogram slots, be careful to copy exactly what the examples do in both your .bpf.c and in your YAML, mysterious '+ 1's and all.)
2024-03-06
Where and how Ubuntu kernels get their ZFS modules
One of the interesting and convenient things about Ubuntu for people like us is that they provide pre-built and integrated ZFS kernel modules in their mainline kernels. If you want ZFS on your (our) ZFS fileservers, you don't have to add any extra PPA repositories or install any extra kernel module packages; it's just there. However, this leaves us with a little mystery, which is how the ZFS modules actually get there. The reason this is a mystery is that the ZFS modules are not in the Ubuntu kernel source, or at least not in the package source.
(One reason this matters is that you may want to see what patches Ubuntu has applied to their version of ZFS, because Ubuntu periodically backports patches to specific issues from upstream OpenZFS. If you go try to find ZFS patches, ZFS code, or a ZFS changelog in the regular Ubuntu kernel source, you will likely fail, and this will not be what you want.)
Ubuntu kernels are normally signed in order to work with Secure Boot. If you use 'apt source ...' on a signed kernel, what you get is not the kernel source but a 'source' that fetches specific unsigned kernels and does magic to sign them and generate new signed binary packages. To actually get the kernel source, you need to follow the directions in Build Your Own Kernel to get the source of the unsigned kernel package. However, as mentioned this kernel source does not include ZFS.
(You may be tempted to fetch the Git repository following the directions in Obtaining the kernel sources using git, but in my experience this may well leave you hunting around in confusing to try to find the branch that actually corresponds to even the current kernel for an Ubuntu release. Even if you have the Git repository cloned, downloading the source package can be easier.)
How ZFS modules get into the built Ubuntu kernel is that during the
package build process, the Ubuntu kernel build downloads or copies
a specific zfs-dkms
package version and includes it in the tree
that kernel modules are built from, which winds up including the
built ZFS kernel modules in the binary kernel packages. Exactly
what version of zfs-dkms will be included is specified in
debian/dkms-versions,
although good luck finding an accurate version of that file in the
Git repository on any predictable branch or in any predictable
location.
(The zfs-dkms package itself is the DKMS version of kernel ZFS modules, which means that it packages the source code of the modules along with directions for how DKMS should (re)build the binary kernel modules from the source.)
This means that if you want to know what specific version of the ZFS code is included in any particular Ubuntu kernel and what changed in it, you need to look at the source package for zfs-dkms, which is called zfs-linux and has its Git repository here. Don't ask me how the branches and tags in the Git repository are managed and how they correspond to released package versions. My current view is that I will be downloading specific zfs-linux source packages as needed (using 'apt source zfs-linux').
The zfs-linux source package is also used to build the zfsutils-linux binary package, which has the user space ZFS tools and libraries. You might ask if there is anything that makes zfsutils-linux versions stay in sync with the zfs-dkms versions included in Ubuntu kernels. The answer, as far as I can see, is no. Ubuntu is free to release new versions of zfsutils-linux and thus zfs-linux without updating the kernel's dkms-versions file to use the matching zfs-dkms version. Sufficiently cautious people may want to specifically install a matching version of zfsutils-linux and then hold the package.
I was going to write something about how you get the ZFS source for a particular kernel version, but it turns out that there is no straightforward way. Contrary to what the Ubuntu documentation suggests, if you do 'apt source linux-image-unsigned-$(uname -r)', you don't get the source package for that kernel version, you get the source package for the current version of the 'linux' kernel package, at whatever is the latest released version. Similarly, while you can inspect that source to see what zfs-dkms version it was built with, 'apt get source zfs-dkms' will only give you (easy) access to the current version of the zfs-linux source package. If you ask for an older version, apt will probably tell you it can't find it.
(Presumably Ubuntu has old source packages somewhere, but I don't know where.)
2024-02-23
Fixing my problem of a stuck 'dnf updateinfo info
' on Fedora Linux
I apply Fedora updates only by hand, and as part of this I like to
look at what 'dnf updateinfo info
' will tell me about why they're
being done. For some time, there's been an issue on my work desktop
where 'dnf updateinfo info' would report on updates that I'd already
applied, often drowning out information about the updates that I
hadn't. This was a bit frustrating, because my home Fedora machine
didn't do this but I couldn't spot anything obviously wrong (and at
various times I'd cleaned all of the DNF caches that I could find).
(Now that I look, it seems I've been having some variant of this problem for a while.)
Recently I took another shot at troubleshooting this. In the system programmer way, I started by locating the Python source code of the DNF updateinfo subcommand and reading it. This showed me a bunch of subcommand specific options that I could have discovered by reading 'dnf updateinfo --help' and led me to find 'dnf updateinfo list', which lists which RPM (or RPMs) a particular update will update. When I used 'dnf updateinfo list' and looked at the list of RPMs, something immediately jumped out at me, and it turned out to be the cause.
My 'dnf updateinfo info' problems were because I had old Fedora 37 'debugsource' RPMs still installed (on a machine now running Fedora 39).
The '-debugsource' and '-debuginfo' RPMs for a given RPM contain symbol information and then source code that is used to allow better debugging (see Debuginfo packages and this change to create debugsource as well). I tend to wind up installing them if I'm trying to debug a crash in some standard packaged program, or sometimes code that heavily uses system libraries. Possibly these packages get automatically cleaned up if you update Fedora releases in one of the officially supported ways, but I do a live upgrade using DNF (following this Fedora documentation). Clearly, when I do such an upgrade, these packages are not removed or updated.
(It's possible that these packages are also not removed or updated within a specific Fedora release when you update their base packages, but since they were installed a long time ago I can't tell at this point.)
With these old debugsource packages hanging around, DNF appears to have reasonably seen more recent versions of them available and duly reported the information on the 'upgrade' (in practice the current version of the package) in 'dnf updateinfo info' when I asked for it. That the packages would not be updated if I did a 'dnf update' was not updateinfo's problem. Removing the debugsource packages eliminated this and now 'dnf updateinfo info' is properly only reporting actual pending updates.
('dnf updateinfo' has various options for what packages to select, but as covered in the updateinfo command documentation apparently they're mostly the same in practice.)
In the future I'm going to have to remember to remove all debugsource and debuginfo packages before upgrading Fedora releases. Possibly I should remove them after I'm done with whatever I installed them for. If I needed them again (in that Fedora release) I'd have to re-fetch them, but that's rare.
PS: In reading the documentation, I've discovered that it's really
'dnf updateinfo --info
'; updateinfo just accepts 'info' (and
'list') as equivalent to the switches.
(This elaborates on a Fediverse post I made at the time.)
2024-02-21
What ZIL metrics are exposed by (Open)ZFS on Linux
The ZFS Intent Log (ZIL) is effectively
ZFS's version of a filesystem journal, writing out hopefully brief
records of filesystem activity to make them durable on disk before
their full version is committed to the ZFS pool. What the ZIL is
doing and how it's performing can be important for the latency (and
thus responsiveness) of various operations on a ZFS filesystem,
since operations like fsync()
on an important file must wait for
the ZIL to write out (commit) their information before they can
return from the kernel. On Linux, OpenZFS
exposes global information about the ZIL in /proc/spl/kstat/zfs/zil
,
but this information can be hard to interpret without some knowledge
of ZIL internals.
(In OpenZFS 2.2 and later, each dataset also has per-dataset ZIL information in its kstat file, /proc/spl/kstat/zfs/<pool>/objset-0xXXX, for some hexadecimal '0xXXX'. There's no overall per-pool ZIL information the way there is a global one, but for most purposes you can sum up the ZIL information from all of the pool's datasets.)
The basic background here is the flow of activity in the ZIL and also the comments in zil.h about
the members of the zil_stats
struct.
The (ZIL) data you can find in the "zil
" file (and the per-dataset
kstats in OpenZFS 2.2 and later) is as follows:
zil_commit_count
counts how many times a ZIL commit has been requested through things likefsync()
.zil_commit_writer_count
counts how many times the ZIL has actually committed. More than one commit request can be merged into the same ZIL commit, if two peoplefsync()
more or less at the same time.zil_itx_count
counts how many intent transactions (itxs) have been written as part of ZIL commits. Each separate operation (such as awrite()
or a file rename) gets its own separate transaction; these are aggregated together into log write blocks (lwbs) when a ZIL commit happens.
When ZFS needs to record file data into the ZIL, it has three options,
which it calls 'indirect
', 'copied
', and 'needcopy
' in ZIL
metrics. Large enough amounts of file data are handled with an
indirect write, which writes the data to its final location in the
regular pool; the ZIL transaction only
records its location, hence 'indirect'. In a copied write, the data
is directly and immediately put in the ZIL transaction (itx), even
before it's part of a ZIL commit; this is done if ZFS knows that the
data is being written synchronously and it's not large enough to trigger
an indirect write. In a needcopy write, the data just hangs around in
RAM as part of ZFS's regular dirty data, and if a ZIL commit happens
that needs that data, the process of adding its itx to the log write
block will fetch the data from RAM and add it to the itx (or at least
the lwb).
There are ZIL metrics about this:
zil_itx_indirect_count
andzil_itx_indirect_bytes
count how many indirect writes have been part of ZIL commits, and the total size of the indirect writes of file data (not of the 'itx' records themselves, per the comments in zil.h).Since these are indirect writes, the data written is not part of the ZIL (it's regular data blocks), although it is put on disk as part of a ZIL commit. However, unlike other ZIL data, the data written here would have been written even without a ZIL commit, as part of ZFS's regular transaction group commit process. A ZIL commit merely writes it out earlier than it otherwise would have been.
zil_itx_copied_count
andzil_itx_copied_bytes
count how many 'copied' writes have been part of ZIL commits and the total size of the file data written (and thus committed) this way.zil_itx_needcopy_count
andzil_itx_needcopy_bytes
count how many 'needcopy' writes have been part of ZIL commits and the total size of the file data written (and thus committed) this way.
A regular system using ZFS may have little or no 'copied' activity. Our NFS servers all have significant amounts of it, presumably because some NFS data writes are done synchronously and so this trickles through to the ZFS stats.
In a given pool, the ZIL can potentially be written to either the
main pool's disks or to a separate log device (a slog, which can
also be mirrored). The ZIL metrics have a collection of
zil_itx_metaslab_*
metrics about data actually written to the
ZIL in either the main pool ('normal' metrics) or to a slog (the
'slog' metrics).
zil_itx_metaslab_normal_count
counts how many ZIL log write blocks (not ZIL records, itxs) have been committed to the ZIL in the main pool. There's a corresponding 'slog' version of this and all further zil_itx_metaslab metrics, with the same meaning.zil_itx_metaslab_normal_bytes
counts how many bytes have been 'used' in ZIL log write blocks (for ZIL commits in the main pool). This is a rough representation of how much space the ZIL log actually needed, but it doesn't necessarily represent either the actual IO performed or the space allocated for ZIL commits.As I understand things, this size includes the size of the intent transaction records themselves and also the size of the associated data for 'copied' and 'needcopy' data writes (because these are written into the ZIL as part of ZIL commits, and so use space in log write blocks). It doesn't include the data written directly to the pool as 'indirect' data writes.
If you don't use a slog in any of your pools, the 'slog' versions of these metrics will all be zero. I think that if you have only slogs, the 'normal' versions of these metrics will all be zero.
In ZFS 2.2 and later, there are two additional statistics for both normal and slog ZIL commits:
zil_itx_metaslab_normal_write
counts how many bytes have actually been written in ZIL log write blocks. My understanding is that this includes padding and unused space at the end of a log write block that can't fit another record.zil_itx_metaslab_normal_alloc
counts how many bytes of space have been 'allocated' for ZIL log write blocks, including any rounding up to block sizes, alignments, and so on. I think this may also be the logical size before any compression done as part of IO, although I'm not sure if ZIL log write blocks are compressed.
You can see some additional commentary on these new stats (and the code) in the pull request and the commit itself.
PS: OpenZFS 2.2 and later has a currently undocumented 'zilstat
'
command, and its 'zilstat -v' output may provide some guidance on
what ratios of these metrics the ZFS developers consider interesting.
In its current state it will only work on 2.2 and later because it
requires the two new stats listed above.
Sidebar: Some typical numbers
Here is the "zil" file from my office desktop, which has been up for long enough to make it interesting:
zil_commit_count 4 13840 zil_commit_writer_count 4 13836 zil_itx_count 4 252953 zil_itx_indirect_count 4 27663 zil_itx_indirect_bytes 4 2788726148 zil_itx_copied_count 4 0 zil_itx_copied_bytes 4 0 zil_itx_needcopy_count 4 174881 zil_itx_needcopy_bytes 4 471605248 zil_itx_metaslab_normal_count 4 15247 zil_itx_metaslab_normal_bytes 4 517022712 zil_itx_metaslab_normal_write 4 555958272 zil_itx_metaslab_normal_alloc 4 798543872
With these numbers we can see interesting things, such as that the average number of ZIL transactions per commit is about 18 and that my machine has never done any synchronous data writes.
Here's an excerpt from one of our Ubuntu 22.04 ZFS fileservers:
zil_commit_count 4 155712298 zil_commit_writer_count 4 155500611 zil_itx_count 4 200060221 zil_itx_indirect_count 4 60935526 zil_itx_indirect_bytes 4 7715170189188 zil_itx_copied_count 4 29870506 zil_itx_copied_bytes 4 74586588451 zil_itx_needcopy_count 4 1046737 zil_itx_needcopy_bytes 4 9042272696 zil_itx_metaslab_normal_count 4 126916250 zil_itx_metaslab_normal_bytes 4 136540509568
Here we can see the drastic impact of NFS synchronous writes (the significant 'copied' numbers), and also of large NFS writes in general (the high 'indirect' numbers). This machine has written many times more data in ZIL commits as 'indirect' writes as it has written to the actual ZIL.
2024-02-20
NetworkManager won't share network interfaces, which is a problem
Today I upgraded my home desktop to Fedora 39. It didn't entirely
go well; specifically, my DSL connection broke because Fedora
stopped packaging some scripts with rp-pppoe and Fedora's
old ifup
, which is used by my very old-fashioned setup still requires those scripts. After I got
back on the Internet, I decided to try an idea I'd toyed with,
namely using NetworkManager to handle (only) my DSL link. Unfortunately this did not go well:
audit: op="connection-activate" uuid="[...]" name="[...]" pid=458524 uid=0 result="fail" reason="Connection '[...]' is not available on device em0 because device is strictly unmanaged"
The reason that em0 is 'unmanaged' by NetworkManager is that it's managed by systemd-networkd, which I like much better. Well, also I specifically told NetworkManager not to touch it by setting it as 'unmanaged' instead of 'managed'.
Although I haven't tested, I suspect that NetworkManager applies this restriction to all VPNs and other layered forms of networking, such that you can only run a NetworkManager managed VPN over a network interface that NetworkManager is controlling. I find this quite unfortunate. There is nothing that NetworkManager needs to change on the underlying Ethernet link to run PPPoE or a VPN over it; the network is a transport (a low level transport in the case of PPPoE).
I don't know if it's theoretically possible to configure NetworkManager so that an interface is 'managed' but NetworkManager doesn't touch it at all, so that systemd-networkd and other things could continue to use em0 while NetworkManager was willing to run PPPoE on top of it. Even if it's possible in theory, I don't have much confidence that it will be problem free in practice, either now or in the future, because fundamentally I'd be lying to NetworkManager and networkd. If NetworkManager really had a 'I will use this interface but not change its configuration' category, it would have a third option besides 'managed or '(strictly) unmanaged'.
(My current solution is a hacked together script to start pppd and pppoe with magic options researched through extrace and a systemd service that runs that script. I have assorted questions about how this is going to interactive with various things, but someday I will get answers, or perhaps unpleasant surprises.)
PS: Where this may be a special problem someday is if I want to run a VPN over my DSL link. I can more or less handle running PPPoE by hand, but the last time I looked at a by hand OpenVPN setup I rapidly dropped the idea. NetworkManager is or would be quite handy for this sort of 'not always there and complex' networking, but it apparently needs to own the entire stack down to Ethernet.
(To run a NetworkManager VPN over 'ppp0', I would have to have NetworkManager manage it, which would presumably require I have NetworkManager handle the PPPoE DSL, which requires NetworkManager not considering em0 to be unmanaged. It's NetworkManager all the way down.)
2024-02-13
What is in (Open)ZFS's per-pool "txgs" /proc file on Linux
As part of (Open)ZFS's general 'kstats' system for reporting information about ZFS overall and your individual pools and datasets, there is a per-pool /proc file that reports information about the most recent N transaction groups ('txgs'), /proc/spl/kstat/zfs/<pool>/txgs. How many N is depends on the zfs_txg_history parameter, and defaults to 100. The information in here may be quite important for diagnosing certain sorts of performance problems but I haven't found much documentation on what's in it. Well, let's try to fix that.
The overall format of this file is:
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime 5846176 7976255438836187 C 1736704 0 5799936 0 299 5119983470 2707 49115 27910766 [...] 5846274 7976757197601868 C 1064960 0 4702208 0 236 5119973466 2405 48349 134845007 5846275 7976762317575334 O 0 0 0 0 0 0 0 0 0
(This example is coming from a system with four-way mirrored vdevs, which is going to be relevant in a bit.)
So lets take these fields in order:
txg
is the transaction group number, which is a steadily increasing number. The file is ordered from the oldest txg to the newest, which will be the current open transaction group.(In the example, txg 5846275 is the current open transaction group and 5846274 is the last one the committed.)
birth
is the time when the transaction group (txg) was 'born', in nanoseconds since the system booted.state
is the current state of the txg; this will most often be either 'C' for committed or 'O' for open. You may also see 'S' for syncing, 'Q' (being quiesced), and 'W' (waiting for sync). An open transaction group will most likely have 0s for the rest of the numbers, and will be the last txg (there's only one open txg at a time).Any transaction group except the second last will be in state 'C', because you can only have one transaction group in the process of being written out.Update: per the comment from Arnaud Gomes, you can have multiple transaction groups at the end that aren't committed. I believe you can only have one that is syncing ('S'), because that happens in a single thread for only one txg, but you may have another that is quiescing or waiting to sync.
A transaction group's progress through its life cycle is open, quiescing, waiting for sync, syncing, and finally committed. In the open state, additional transactions (such as writing to files or renaming them) can be added to the transaction group; once a transaction group has been quiesced, nothing further will be added to it.
(See also ZFS fundamentals: transaction groups, which discusses how a transaction group can take a while to sync; the content has also been added as a comment in the source code in txg.c.)
ndirty
is how many bytes of directly dirty data had to be written out as part of this transaction; these bytes come, for example, from userwrite()
IO.It's possible to have a transaction group commit with a '0' for
ndirty
. I believe that this means no IO happened during the time the transaction group was open, and it's just being closed on the timer.nread
is how many bytes of disk reads the pool did between when syncing of the txg starts and when it finishes ('during txg sync').nwritten
is how many bytes of disk writes the pool did during txg sync.reads
is the number of disk read IOs the pool did during txg sync.writes
is the number of disk write IOs the pool did during txg sync.I believe these IO numbers include at least any extra IO needed to read in on-disk data structures to allocate free space and any additional writes necessary. I also believe that they track actual bytes written to your disks, so for example with two-way mirrors they'll always be at least twice as big as the
ndirty
number (in my example above, with four way mirrors, their base is four timesndirty
).As we can see it's not unusual for
nread
andreads
to be zero. However, I don't believe that the read IO numbers are restricted to transaction group commit activities; if something is reading from the pool for other reasons during the transaction group commit, that will show up innread
andreads
. They are thus a measure of the amount of read IO going during the txg sync process, not the amount of IO necessary for it.I don't know if ongoing write IO to the ZFS Intent Log can happen during a txg sync. If it can, I would expect it to show up in the
nwritten
andwrites
numbers. Unlike read IO, regular write IO can only happen in the context of a transaction group and so by definition any regular writes during a txg sync are part of that txg and show up inndirty
.otime
is how long the txg was open and accepting new write IO, in nanoseconds. Often this will be around the default zfs_txg_timeout time, which is normally five seconds. However, under (write) IO pressure this can be shorter or longer (if the current open transaction group can't be closed because there's already a transaction group in the process of trying to commit).qtime
is how long the txg took to be quiesced, in nanoseconds; it's usually small.wtime
is how long the txg took to wait to start syncing, in nanoseconds; it's usually pretty small, since all it involves is that the separate syncing thread pick up the txg and start syncing it.stime
is how long the txg took to actually sync and commit, again in nanoseconds. It's often appreciable, since it's where the actual disk write IO happens.
In the example "txgs" I gave, we can see that despite the first committed txg listed having more dirty data than the last committed txg, its actual sync time was only about a quarter of the last txg's sync time. This might cause you to look at underlying IO activity patterns, latency patterns, and so on.
As far as I know, there's no per-pool source of information about
the current amount of dirty data in the current open transaction
group (although once a txg has quiesced and is syncing, I believe
you do see a useful ndirty
for it in the "txgs" file). A system
wide dirty data number can more or less be approximated from the
ARC memory reclaim statistics in
the anon_size
kstat plus the arc_tempreserve
kstat, although
the latter seems to never get very big for us.
A new transaction group normally opens as the current transaction
group begins quiescing. We can verify this in the example output
by adding the birth time and the otime
of txg 5846274, which add
up to exactly the birth time of txg 5846275, the current open txg.
If this sounds suspiciously exact down to the nanosecond, that's
because the code involve freezes the current time at one point and
uses it for both the end of the open time of the current open txg
and the birth time of the new txg.
Sidebar: the progression through transaction group states
Here is what I can deduce from reading through the OpenZFS kernel code, and since I had to go through this I'm going to write it down.
First, although there is a txg 'birth' state, 'B' in the 'state' column, you will never actually see it. Transaction groups are born 'open', per spa_txg_history_add() in spa_stats.c. Transaction groups move from 'O' open to 'Q' quiescing in txg_quiesce() in txg.c, which 'blocks until all transactions in the group are committed' (which I believe means they are finished fiddling around adding write IO). This function is also where the txg finishes quiescing and moves to 'W', waiting for sync. At this point the txg is handed off to the 'sync thread', txg_sync_thread() (also in txg.c). When the sync thread receives the txg, it will advance the txg to 'S', syncing, call spa_sync(), and then mark everything as done, finally moving the transaction group to 'C', committed.
(In the spa_stats.c code, the txg state is advanced by a call to spa_txg_history_set(), which will always be called with the old state we are finishing. Txgs advance to syncing in spa_txg_history_init_io(), and finish this state to move to committed in spa_txg_history_fini_io(). The tracking of read and write IO during the txg sync is done by saving a copy of the top level vdev IO stats in spa_txg_history_init_io(), getting a second copy in spa_txg_history_fini_io(), and then computing the difference between the two.)
Why it might take some visible time to quiesce a transaction group is more or less explained in the description of how ZFS's implementations of virtual filesystem operations work, in the comment at the start of zfs_vnops_os.c. Roughly, each operation (such as creating or renaming a file) starts by obtaining a transaction that will be part of the currently open txg, then doing its work, and then committing the transaction. If the transaction group starts quiescing while the operation is doing its work, the quiescing can't finish until the work does and commits the transaction for the rename, create, or whatever.