Chris's Wiki :: blog/linuxhttps://utcc.utoronto.ca/~cks/space/blog/linux/?atomDWiki2024-03-23T03:09:59ZRecently changed pages in Chris's Wiki :: blog/linux.tag:cspace@cks.mef.org,2009-03-24:/blog/linux/TaskDelayAccountingNotescks<div class="wikitext"><p>If you run a recent enough version of <a href="https://man7.org/linux/man-pages/man8/iotop.8.html">iotop</a> on a typical Linux
system, it may nag at you to the effect of:</p>
<blockquote><p>CONFIG_TASK_DELAY_ACCT and <a href="https://docs.kernel.org/admin-guide/sysctl/kernel.html#task-delayacct">kernel.task_delayacct</a>
sysctl not enabled in kernel, cannot determine SWAPIN and IO %</p>
</blockquote>
<p>You might wonder whether you should turn on this sysctl, how much
you care, and why it was defaulted to being disabled in the first
place.</p>
<p>This sysctl enables <a href="https://docs.kernel.org/accounting/delay-accounting.html">(Task) Delay accounting</a>, which
tracks things like how long things wait for the CPU or wait for
their IO to complete on a per-task basis (which in Linux means
'thread', more or less). General system information will provide
you an overall measure of this in things like 'iowait%' and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PSINumbersAndMeanings">pressure
stall information</a>, but those are aggregates;
you may be interested in known things like how much specific processes
are being delayed or are waiting for IO.</p>
<p>(Also, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LinuxMultiCPUIowait">overall system iowait% is a conservative measure</a> and won't give you a completely accurate
picture of how much processes are waiting for IO. You can get
per-cgroup pressure stall information, which in some cases can
come close to a per-process number.)</p>
<p>In the context of <a href="https://man7.org/linux/man-pages/man8/iotop.8.html">iotop</a> specifically, the major thing you will
miss is 'IO %', which is the percent of the time that a particular
process is waiting for IO. Task delay accounting can give you
information about <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemResponseLatencyMetrics">per-process (or task) run queue latency</a> but I don't know if there are any
tools similar to iotop that will give you this information. There
is a program in the kernel source, <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/accounting/getdelays.c">tools/accounting/getdelays.c</a>,
that will dump the raw information on a one-time basis (and in some
versions, compute averages for you, which may be informative). The
(current) task delay accounting information you can theoretically
get is documented in comments in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/taskstats.h">include/uapi/linux/taskstats.h</a>,
or <a href="https://docs.kernel.org/accounting/taskstats-struct.html">this version in the documentation</a>. You
may also want to look at <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/delayacct.h">include/linux/delayacct.h</a>,
which I think is the kernel internal version that tracks this
information.</p>
<p>(You may need the version of getdelays.c from your kernel's source tree,
as the current version may not be backward compatible to your kernel.
This typically comes up as compile errors, which are at least obvious.)</p>
<p>How you can access this information yourself is sort of covered in
<a href="https://docs.kernel.org/accounting/taskstats.html">Per-task statistics interface</a>, but in practice
you'll want to read the source code of getdelays.c or the Python
source code of <a href="https://repo.or.cz/iotop.git">iotop</a>. If you
specifically want to track how long a task spends delaying for IO,
there is also a field for it in /proc/<pid>/stat; per <a href="https://man7.org/linux/man-pages/man5/proc.5.html">proc(5)</a>, field 42 is
delayacct_blkio_ticks. As far as I can tell from the kernel
source, this is the same information that <a href="https://docs.kernel.org/accounting/taskstats.html">the netlink interface</a> will provide,
although it only has the total time waiting for 'block' (filesystem)
IO and doesn't have the count of block IO operations.</p>
<p>Task delay accounting can theoretically be requested on a per-cgroup
basis (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LoadAverageWhereFrom">as I saw in a previous entry on where the Linux load average
comes from</a>), but in practice this only works
for <a href="https://docs.kernel.org/admin-guide/cgroup-v1/index.html">cgroup v1</a>.
This (task) delay accounting has never been added to <a href="https://docs.kernel.org/admin-guide/cgroup-v2.html">cgroup v2</a>, which may be
a sign that the whole feature is a bit neglected.
I couldn't find much to say why delay accounting was changed (in
2021) to default to being off. <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e4042ad492357fa995921376462b04a025dd53b6">The commit that made this change</a>
seems to imply it was defaulted to off on the assumption that it
wasn't used much. Also see <a href="https://lore.kernel.org/all/20210505111525.308018373@infradead.org/T/">this kernel mailing list message</a> and
<a href="https://old.reddit.com/r/linuxquestions/comments/1b6bijd/downsides_to_kerneltask_delayacct/">this reddit thread</a>.</p>
<p>Now that I've discovered kernel.task_delayacct and played around
with it a bit, I think it's useful enough for us for diagnosing
issues that we're going to turn it on by default until and unless
we see problems (performance or otherwise). Probably I'll stick to
doing this with an /etc/sysctl.d/ drop in file, because I think
that gets activated early enough in boot to cover most processes
of interest.</p>
<p>(As covered somewhere, if you turn delay accounting on through the
sysctl, it apparently only covers processes that were started after
the sysctl was changed. Processes started before have no delay
accounting information, or perhaps only 'CPU' delay accounting
information. One such process is init, PID 1, which will always
be started before the sysctl is set.)</p>
<p>PS: The per-task IO delays do include NFS IO, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSIOShowsInIowait">just as iowait does</a>, which may make it more interesting if you
have NFS clients. Sometimes it's obvious which programs are being
affected by slow NFS servers, but sometimes not.</p>
</div>
The Linux kernel.task_delayacct sysctl and why you might care about it2024-03-23T03:09:59Z2024-03-23T03:09:37Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/CpufreqSlowToReadcks<div class="wikitext"><p>The Linux kernel has a CPU frequency (management) system, called
<a href="https://docs.kernel.org/admin-guide/pm/cpufreq.html">cpufreq</a>.
As part of this, Linux (on supported hardware) exposes various CPU
frequency information under /sys/devices/system/cpu, as covered in
<a href="https://docs.kernel.org/admin-guide/pm/cpufreq.html#policy-interface-in-sysfs">Policy Interface in sysfs</a>.
Reading these files can provide you with some information about the
state of your system's CPUs, especially their current frequency
(more or less). This information is considered interesting enough
that <a href="https://github.com/prometheus/node_exporter">the Prometheus host agent</a> collects (some) cpufreq
information by default. However, there is a little caution, which
is that apparently <strong>the kernel deliberately slows down reading
this information from <code>/sys</code></strong> (<a href="https://mastodon.social/@cks/112134498618733513">as I learned recently</a>. A comment in
<a href="https://github.com/prometheus/procfs/blob/master/sysfs/system_cpu.go#L229">the relevant Prometheus code</a>
says that this delay is 50 milliseconds, but <a href="https://github.com/prometheus/procfs/commit/6914037aeaef8fdaaefc4874864fe8ca9c9f8af1">this comment dates
from 2019</a>
and may be out of date now (I wasn't able to spot the slowdown in
the kernel code itself).</p>
<p>On a machine with only a few CPUs, reading this information is
probably not going to slow things down enough that you really notice.
On a machine with a lot of CPUs, the story can be very different.
We have one AMD 512-CPU machine, and on this machine reading every
CPU's scaling_cur_freq one at a time takes over ten seconds:</p>
<blockquote><pre style="white-space: pre-wrap;">
; cd /sys/devices/system/cpu/cpufreq
; time cat policy*/scaling_cur_freq >/dev/null
10.25 real 0.07 user 0.00 kernel
</pre>
</blockquote>
<p>On a 112-CPU Xeon Gold server, things are not so bad at 2.24
seconds; a 128-Core AMD takes 2.56 seconds. A 64-CPU server
is down to 1.28 seconds, a 32-CPU one 0.64 seconds, and on my
16-CPU and 12-CPU desktops (running Fedora instead of Ubuntu)
the time is reported as '0.00 real'.</p>
<p>This potentially matters on high-CPU machines where you're running
any sort of routine monitoring that tries to read this information,
including the Prometheus host agent in its default configuration.
The Prometheus host agent reduces the impact of this slowdown
somewhat, but it's still noticeably slower to collect all of the
system information if we have the 'cpufreq' collector enabled on
these machines. As a result of discovering this, I've now disabled
the Prometheus host agent's 'cpufreq' collector on anything with
64 cores or more, and we may reduce that in the future. We don't
have a burning need to see CPU frequency information and we would
like to avoid slow data collection and occasional apparent impacts
on the rest of the system.</p>
<p>(Typical Prometheus configurations magnify the effect of the slowdown
because it's common to query ('scrape') the host agent quite often,
for example every fifteen seconds. Every time you do this, the host
agent re-reads these cpufreq sysfs files and hits this delay.)</p>
<p>PS: I currently have no views on how useful the system's CPU
frequencies are as a metric, and how much they might be perturbed
by querying them (although the Prometheus host agent deliberately
pretends it's running on a single-CPU machine, partly to avoid
problems in this area). If you do, you might either universally not
collect CPU frequency information or take the time impact to do so
even on high-CPU machines.</p>
</div>
Reading the Linux cpufreq sysfs interface is (deliberately) slow2024-03-22T03:09:42Z2024-03-22T03:09:03Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/PidsTgidsAndTaskscks<div class="wikitext"><p>In the beginning, Unix only had processes and processes had process
IDs (PIDs), and life was simple. Then people added (kernel-supported)
threads, so processes could be multi-threaded. When you add threads,
you need to give them some user-visible identifier. There are many
options for what this identifier is and how it works (and how threads
themselves work inside the kernel). The choice Linux made was that
threads were just processes (that shared more than usual with other
processes), and so their identifier was a process ID, allocated
from the same global space of process IDs as regular independent
processes. This has created some ambiguity in what programs and
other tools mean by 'process ID' (including for me).</p>
<p>The true name for what used to be a 'process ID', which is to say
the PID of the overall entity that is 'a process with all its
threads', is a <em>TGID</em> (Thread or Task Group ID). The TGID of a
process is the PID of the main thread; a single-threaded program
will have a TGID that is the same as its PID. You can see this in
the 'Tgid:' and 'Pid:' fields of /proc/<PID>/status. Although some
places will talk about 'pids' as separate from 'tids' (eg some parts
of <a href="https://man7.org/linux/man-pages/man5/proc.5.html"><code>proc(5)</code></a>),
the two types are both allocated from the same range of numbers
because they're both 'PIDs'. If I just give you a 'PID' with no
further detail, there's no way to know if it's a process's PID or
a task's PID.</p>
<p>In every /proc/<PID> directory, there is a 'tasks' subdirectory;
this contains the PIDs of all <em>tasks</em> (threads) that are part of
the thread group (ie, have the same TGID). All PIDs have a /proc/<PID>
directory, but for convenience things like 'ls /proc' only lists
the PIDs of processes (which you can think of as TGIDs). The
/proc/<PID> directories for other tasks aren't returned by the
kernel when you ask for the directory contents of /proc, although
you can use them if you access them directly (and you can also
access or discover them through /proc/<PID>/tasks). I'm not sure
what information in the /proc/<PID> directories for tasks are
specific to the task itself or are in total across all tasks in the
TGID. The <a href="https://man7.org/linux/man-pages/man5/proc.5.html"><code>proc(5)</code></a>
manual page sometimes talks about processes and sometimes about
tasks, but I'm not sure that's comprehensive.</p>
<p>(Much of the time when you're looking at what is actually a TGID,
you want the total information across all threads in the TGID. If
/proc/<PID> always gave you only task information even for the
'process' PID/TGID, multi-threaded programs could report confusingly
low numbers for things like CPU usage unless you went out of your way
to sum /proc/<PID>/tasks/* information yourself.)</p>
<p>Various tools will normally return the PID (TGID) of the overall
process, not the PID of a random task in a multi-threaded process.
For example 'pidof <thing>' behaves this way. Depending on how the
specific process works, this may or may not be the 'main thread'
of the program (some multi-threaded programs more or less park their
initial thread and do their main work on another one created later),
and the program may not even have such a thing (I believe Go programs
mostly don't, as they multiplex goroutines on to actual threads as
needed).</p>
<p>If a tool or system offers you the choice to work on or with a 'PID'
or a 'TGID', you are being given the choice to work with a single
thread (task) or the overall process. Which one you want depends
on what you're doing, but if you're doing things like <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LoadAverageWhereFrom">asking for
task delay information</a>, using the TGID may
better correspond to what you expect (since it will be the overall
information for the entire process, not information for a specific
thread). If a program only talks about PIDs, it's probably going
to operate on or give you information about the entire process by
default, although if you give it the PID of a task within the process
(instead of the PID that is the TGID), you may get things specific
to that task.</p>
<p>In a kernel context such as <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/EbpfExporterNotes">eBPF programs</a>, I
think you'll almost always want to track things by PID, not TGID.
It is PIDs that do things like <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemResponseLatencyMetrics">experience run queue scheduling
latency</a>, make system calls, and incur
block IO delays, not TGIDs. However, if you're selecting what to
report on, monitor, and so on, you'll most likely want to match on
the TGID, not the PID, so that you report on all of the tasks in a
multi-threaded program, not just one of them (unless you're
specifically looking at tasks/threads, not 'a process').</p>
<p>(I'm writing this down partly to get it clear in my head, since I
had some confusion recently when working with <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/EbpfExporterNotes">eBPF programs</a>.)</p>
</div>
Sorting out PIDs, Tgids, and tasks on Linux2024-03-19T02:00:59Z2024-03-19T01:59:58Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/IoniceNotesIIcks<div class="wikitext"><p>In the long ago past, Linux gained some support for <a href="https://www.kernel.org/doc/Documentation/block/ioprio.txt">block IO
priorities</a>,
with some limitations that I noticed <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/IoniceNotes">the first time I looked into
this</a>. These days the Linux kernel has support for
more IO scheduling and limitations, for example in <a href="https://docs.kernel.org/admin-guide/cgroup-v2.html">cgroups v2</a> and <a href="https://docs.kernel.org/admin-guide/cgroup-v2.html#io">its IO
controller</a>.
However <a href="https://man7.org/linux/man-pages/man1/ionice.1.html"><code>ionice</code></a>
is still there and now I want to note some more things, since I
just looked at ionice again (for reasons outside the scope of this
entry).</p>
<p>First, <a href="https://man7.org/linux/man-pages/man1/ionice.1.html"><code>ionice</code></a> and the IO priorities it sets are specifically
only for read IO and synchronous write IO, per <a href="https://man7.org/linux/man-pages/man2/ioprio_set.2.html"><code>ioprio_set(2)</code></a> (this is
the underlying system call that <code>ionice</code> uses to set priorities).
This is reasonable, since IO priorities are attached to processes
and asynchronous write IO is generally actually issued by completely
different kernel tasks and in situations where the urgency of doing
the write is unrelated to the IO priority of the process that
originally did the write. This is a somewhat unfortunate limitation
since often it's write IO that is the slowest thing and the source
of the largest impacts on overall performance.</p>
<p>IO priorities are only effective with some <a href="https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers">Linux kernel IO schedulers</a>, such as <a href="https://docs.kernel.org/block/bfq-iosched.html">BFQ</a>. For obvious reasons
they aren't effective with the 'none' scheduler, which is also the
default scheduler for NVMe drives. I'm (still) unable to tell if IO
priorities work if you're using software RAID instead of sitting your
(supported) filesystem directly on top of a SATA, SAS, or NVMe disk. I
believe that IO priorities are unlikely to work with ZFS, partly because
ZFS often issues read IOs through its own kernel threads instead of
directly from your process and those kernel threads probably aren't
trying to copy around IO priorities.</p>
<p>Even if they pass through software RAID, IO priorities apply at the
level of disk devices (of course). This means that each side of a
software RAID mirror will do IO priorities only 'locally', for IO
issued to it, and I don't believe there will be any global priorities
for read IO to the overall software RAID mirror. I don't know if
this will matter in practice. Since IO priorities only apply to
disks, they obviously don't apply (on the NFS client) to NFS read
IO. Similarly, IO priorities don't apply to data read from the
kernel's buffer/page caches, since this data is already in RAM and
doesn't need to be read from disk. This can give you an ionice'd
program that is still 'reading' lots of data (and that data will
be less likely to be evicted from kernel caches).</p>
<p>Since <a href="https://support.cs.toronto.edu/">we</a> mostly use some combination
of software RAID, ZFS, and NFS, I don't think <code>ionice</code> and IO priorities
are likely to be of much use for us. If we want to limit the impact a
program's IO has on the rest of the system, we need different measures.</p>
</div>
Some more notes on Linux's <code>ionice</code> and kernel IO priorities2024-03-17T03:04:26Z2024-03-17T03:03:23Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdNetworkdResetsIpRulescks<div class="wikitext"><p>Here's something that I learned recently: if <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd-networkd.service.html#">systemd-networkd</a>
restarts, for example because of a package update for it that
includes an automatic daemon restart, it will clear your 'ip rules'
routing policies (and also I think your routing table, although you
may not notice that much). If you've set up policy based routing of
your own (or some program has done that as part of its operation),
this may produce unpleasant surprises.</p>
<p>Systemd-networkd does this fundamentally because <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.network.html#%5BRoutingPolicyRule%5D%20Section%20Options">you can set ip
routing policies in .network files</a>.
When networkd is restarted, one of the things it does is re-set-up
whatever routing policies you specified; if you didn't specify any,
it clears them. This is a reasonably sensible decision, both to
deal with changes from previously specified routing policies and
to also give people a way to clean out their experiments and reset
to a known good base state. Similar logic applies to routes.</p>
<p>This can be controlled through <a href="https://www.freedesktop.org/software/systemd/man/latest/networkd.conf.html">networkd.conf</a>
and its drop-in files, by setting <a href="https://www.freedesktop.org/software/systemd/man/latest/networkd.conf.html#ManageForeignRoutingPolicyRules="><code>ManageForeignRoutingPolicyRules=no</code></a>
and perhaps <a href="https://www.freedesktop.org/software/systemd/man/latest/networkd.conf.html#ManageForeignRoutes="><code>ManageForeignRoutes=no</code></a>.
Without testing it through a networkd restart, I believe that the
settings I want are:</p>
<blockquote><pre style="white-space: pre-wrap;">
[Network]
ManageForeignRoutingPolicyRules=no
ManageForeignRoutes=no
</pre>
</blockquote>
<p>The minor downside of this for me is that certain sorts of route updates
will have to be done by hand, instead of by updating .network files and
then restarting networkd.</p>
<p>While having an option to do this sort of clearing is sensible, I
am dubious about the current default. In practice, coherently
specifying routing policies through .network files is so much of a
pain that I suspect that few people do it that way; instead I suspect
that most people either script it to issue the 'ip rule' commands
(as I do) or use software that does it for them (and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LinuxIpFwmarkMasks">I know that
such software exists</a>). It would be great if
networkd could create and manage high level policies for you (such
as isolated interfaces), but the current approach is both verbose
and limited in what you can do with it.</p>
<p>(As far as I know, networkd can't express rules for networks that
can be brought up and torn down, because it's not an event-based
system where you can have it react to the appearance of an interface
or a configured network. It's possible I'm wrong, but if so it
doesn't feel well documented.)</p>
<p>All of this is especially unfortunate on Ubuntu servers, which normally
configure their networking through netplan. Netplan will more or less
silently use networkd as the backend to actually implement what you
wrote in your Netplan configuration, leaving you exposed to this, and on
top of that Netplan itself has limitations on what routing policies you
can express (pushing you even more towards running 'ip rule' yourself).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdNetworkdResetsIpRules?showcomments#comments">2 comments</a>.) </div>Restarting systemd-networkd normally clears your 'ip rules' routing policies2024-03-14T02:19:13Z2024-03-14T02:18:11Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemResponseLatencyMetricscks<div class="wikitext"><p>One of the things that I do on my desktops and <a href="https://support.cs.toronto.edu/">our</a> servers is collect metrics that
I hope will let me assess how responsive our systems are when people
are trying to do things on them. For a long time I've been collecting
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PrometheusLinuxDiskIOStats">disk IO latency histograms</a>, and
recently I've been collecting runqueue latency histograms (using
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/EbpfExporterNotes">the eBPF exporter</a> and a modified version of
<a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/runqlat.bpf.c">libbpf/tools/runqlat.bpf.c</a>).
This has caused me to think about the various sorts of latency that
affects responsiveness and how I can measure it.</p>
<p>Run queue latency is the latency between when a task becomes able
to run (or when it got preempted in the middle of running) and when
it does run. This latency is effectively the minimum (lack of)
response from the system and is primarily affected by CPU contention,
since the major reason tasks have to wait to run is other tasks
using the CPU. For obvious reasons, high(er) run queue latency is
related to <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PSINumbersAndMeanings">CPU pressure stalls</a>, but a
histogram can show you more information than an aggregate number.
I expect run queue latency to be what matters most for a lot of
programs that mostly talk to things over some network (including
talking to other programs on the same machine), and perhaps some
of their time burning CPU furiously. If your web browser can't get
its rendering process running promptly after the HTML comes in, or
if it gets preempted while running all of that Javascript, this
will show up in run queue latency. The same is true for your window
manager, which is probably not doing much IO.</p>
<p>Disk IO latency is the lowest level indicator of things having to
wait on IO; it sets a lower bound on how little latency processes
doing IO can have (assuming that they do actual disk IO). However,
direct disk IO is only one level of the Linux IO system, and the
Linux IO system sits underneath filesystems. What actually matters
for responsiveness and latency is generally how long user-level
filesystem operations take. In an environment with sophisticated,
multi-level filesystems that have complex internal behavior (such
as <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSGlobalZILInformation">ZFS and its ZIL</a>), the actual disk
IO time may only be a small portion of the user-level timing,
especially for things like <code>fsync()</code>.</p>
<p>(Some user-level operations may also not do any disk IO at all
before they return from the kernel (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/UserIOCanBeSystemTime">for example</a>).
A <code>read()</code> might be satisfied from the kernel's caches, and a
<code>write()</code> might simply copy the data into the kernel and schedule
disk IO later. This is where histograms and related measurements
become much more useful than averages.)</p>
<p>Measuring user level filesystem latency can be done through eBPF,
to at least some degree; <a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/vfsstat.bpf.c">libbpf-tools/vfsstat.bpf.c</a>
hooks a number of kernel vfs_* functions in order to just count
them, and you could convert this into some sort of histogram. Doing
this on a 'per filesystem mount' basis is probably going to be
rather harder. On the positive side for us, hooking the vfs_*
functions does cover the activity a NFS server does for NFS clients
as well as truly local user level activity. Because there are a
number of systems where we really do care about the latency that
people experience and want to monitor it, I'll probably build some
kind of vfs operation latency histogram <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/EbpfExporterNotes">eBPF exporter program</a>, although most likely only for selected VFS
operations (since there are a lot of them).</p>
<p>I think that the straightforward way of measuring user level IO
latency (by tracking the time between entering and exiting a top
level vfs_* function) will wind up including run queue latency
as well. You will get, basically, the time it takes to prepare and
submit the IO inside the kernel, the time spent waiting for it, and
then after the IO completes the time the task spends waiting inside
the kernel before it's able to run.</p>
<p>Because of <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LinuxMultiCPUIowait">how Linux defines iowait</a>, the
higher your iowait numbers are, the lower the run queue latency
portion of the total time will be, because iowait only happens on
idle CPUs and idle CPUs are immediately available to run tasks when
their IO completes. You may want to look at <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PSINumbersAndMeanings">io pressure stall
information</a> for a more accurate track of
when things are blocked on IO.</p>
<p>A complication of measuring user level IO latency is that not all
user visible IO happens through <code>read()</code> and <code>write()</code>. Some of it
happens through accessing <code>mmap()</code>'d objects, and under memory
pressure some of it will be in the kernel paging things back in
from wherever they wound up. I don't know if there's any particularly
easy way to hook into this activity.</p>
</div>
Scheduling latency, IO latency, and their role in Linux responsiveness2024-03-11T03:32:47Z2024-03-11T03:31:46Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/EbpfExporterNotescks<div class="wikitext"><p>I've been a fan of <a href="https://github.com/cloudflare/ebpf_exporter">the Cloudflare eBPF Prometheus exporter</a> for some time, ever
since I saw their example of per-disk IO latency histograms. And
the general idea is extremely appealing; you can gather a lot of
information with eBPF (usually from the kernel), and the ability
to turn it into metrics is potentially quite powerful. However,
actually using it has always been a bit arcane, especially if you
were stepping outside the bounds of Cloudflare's <a href="https://github.com/cloudflare/ebpf_exporter/tree/master/examples">canned examples</a>.
So here's some notes on the current version (which is more or less
v2.4.0 as I write this), written in part for me in the future when
I want to fiddle with eBPF-created metrics again.</p>
<p>If you build the ebpf_exporter yourself, you want to use their
provided Makefile rather than try to do it directly. This Makefile
will give you the choice to build a 'static' binary or a dynamic
one (with 'make build-dynamic'); the static is the default. I put
'static' into quotes because of <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LinuxStaticLinkingVsGlibc">the glibc NSS problem</a>; if you're on a glibc-using Linux, your
static binary will still depend on your version of glibc. However,
it will contain a statically linked libbpf, which will make your
life easier. Unfortunately, building a static version is impossible
on some Linux distributions, such as Fedora, because Fedora just
doesn't provide static versions of some required libraries (as far
as I can tell, libelf.a). If you have to build a dynamic executable,
a normal ebpf_exporter build will depend on the libbpf shared
library you can find in libbpf/dest/usr/lib. You'll need to set a
<code>LD_LIBRARY_PATH</code> to find this copy of libbpf.so at runtime.</p>
<p>(You can try building with the system libbpf, but it may not be
recent enough for ebpf_exporter.)</p>
<p>To get metrics from eBPF with ebpf_exporter, you need an eBPF
program that collects the metrics and then a YAML configuration
that tells ebpf_exporter how to handle what the eBPF program
provides. The original version of ebpf_exporter had you specify
eBPF programs in text in your (YAML) configuration file and then
compiled them when it started. This approach has fallen out of
favour, so now eBPF programs must be pre-compiled to special .o
files that are loaded at runtime. I believe these .o files are
relatively portable across systems; I've used ones built on Fedora
39 on Ubuntu 22.04. The simplest way to build either a provided
example or your own one is to put it in <a href="https://github.com/cloudflare/ebpf_exporter/tree/master/examples">the <code>examples</code> directory</a>
and then do 'make <name>.bpf.o'. Running 'make' in the examples
directory will build all of the standard examples.</p>
<p>To run an eBPF program or programs, you copy their <name>.bpf.o and
<name>.yaml to a configuration directory of your choice, specify
this directory in theebpf_exporter '<code>--config.dir</code>' argument,
and then use '<code>--config.names=<name>,<name2>,...</code>' to say what
programs to run. The suffix of the YAML configuration file and the
eBPF object file are always fixed.</p>
<p>The repository has <a href="https://github.com/cloudflare/ebpf_exporter#configuration-concepts">some documentation on the YAML (and eBPF) that
you have to write to get metrics</a>.
However, it is probably not sufficient to explain how to modify the
examples or especially to write new ones. If you're doing this (for
example, to revive an old example that was removed when the exporter
moved to the current pre-compiled approach), you really want to
read over existing examples and then copy their general structure
more or less exactly. This is especially important because the main
ebpf_exporter contains some special handling for at least
histograms that assumes things are being done as in their examples.
When reading examples, it helps to know that Cloudflare has a bunch
of helpers that are in various header files in the examples directory.
You want to use these helpers, not the normal, standard <a href="https://man7.org/linux/man-pages/man7/bpf-helpers.7.html">bpf helpers</a>.</p>
<p>(However, although not documented in <a href="https://man7.org/linux/man-pages/man7/bpf-helpers.7.html">bpf-helpers(7)</a>,
'<code>__sync_fetch_and_add()</code>' is a standard eBPF thing. It is not
so much documented as mentioned in <a href="https://docs.kernel.org/bpf/map_array.html">some kernel BPF documentation
on arrays and maps</a>
and in <a href="https://man7.org/linux/man-pages/man2/bpf.2.html">bpf(2)</a>.)</p>
<p>One source of (e)BPF code to copy from that is generally similar
to what you'll write for ebpf_exporter is <a href="https://github.com/iovisor/bcc/tree/master/libbpf-tools">bcc/libbpf-tools</a> (in the
<name>.bpf.c files). An eBPF program like <a href="https://github.com/iovisor/bcc/tree/master/libbpf-tools/runqlat.bpf.c">runqlat.bpf.c</a>
will need restructuring to be used as an ebpf_exporter program,
but it will show you what you can hook into with eBPF and how.
Often these examples will be more elaborate than you need for
ebpf_exporter, with more options and the ability to narrowly
select things; you can take all of that out.</p>
<p>(When setting up things like the number of histogram slots, be
careful to copy exactly what the examples do in both your .bpf.c
and in your YAML, mysterious '+ 1's and all.)</p>
</div>
Some notes about the Cloudflare eBPF Prometheus exporter for Linux2024-03-11T00:26:22Z2024-03-08T04:01:56Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/UbuntuKernelsZFSWhereFromcks<div class="wikitext"><p>One of the interesting and convenient things about Ubuntu for
people like <a href="https://support.cs.toronto.edu/">us</a> is that they
provide pre-built and integrated ZFS kernel modules in their
mainline kernels. If you want ZFS on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">your (our) ZFS fileservers</a>, you don't have to add any extra PPA
repositories or install any extra kernel module packages; it's just
there. However, this leaves us with <a href="https://mastodon.social/@cks/112041217999758599">a little mystery</a>, which is how
the ZFS modules actually get there. The reason this is a mystery
is that <strong>the ZFS modules are not in the Ubuntu kernel source</strong>,
or at least not in the package source.</p>
<p>(One reason this matters is that you may want to see what patches
Ubuntu has applied to their version of ZFS, because Ubuntu periodically
backports patches to specific issues from upstream OpenZFS. If you
go try to find ZFS patches, ZFS code, or a ZFS changelog in the
regular Ubuntu kernel source, you will likely fail, and this will not
be what you want.)</p>
<p>Ubuntu kernels are normally signed in order to work with <a href="https://wiki.debian.org/SecureBoot">Secure
Boot</a>. If you use 'apt source
...' on a signed kernel, what you get is not the kernel source but
a 'source' that fetches specific unsigned kernels and does magic
to sign them and generate new signed binary packages. To actually
get the kernel source, you need to follow the directions in <a href="https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel">Build
Your Own Kernel</a>
to get the source of the unsigned kernel package. However, as
mentioned this kernel source does not include ZFS.</p>
<p>(You may be tempted to fetch the Git repository following the
directions in <a href="https://wiki.ubuntu.com/Kernel/Dev/KernelGitGuide#Kernel.2FAction.2FGitTheSource.Obtaining_the_kernel_sources_for_an_Ubuntu_release_using_git">Obtaining the kernel sources using git</a>,
but in my experience this may well leave you hunting around in
confusing to try to find the branch that actually corresponds to
even the current kernel for an Ubuntu release. Even if you have the
Git repository cloned, downloading the source package can be easier.)</p>
<p>How ZFS modules get into the built Ubuntu kernel is that during the
package build process, <strong>the Ubuntu kernel build downloads or copies
a specific <code>zfs-dkms</code> package version and includes it in the tree
that kernel modules are built from</strong>, which winds up including the
built ZFS kernel modules in the binary kernel packages. Exactly
what version of zfs-dkms will be included is specified in
<a href="https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/tree/debian/dkms-versions?h=Ubuntu-5.15.0-88.98">debian/dkms-versions</a>,
although good luck finding an accurate version of that file in the
Git repository on any predictable branch or in any predictable
location.</p>
<p>(The zfs-dkms package itself is the <a href="https://en.wikipedia.org/wiki/Dynamic_Kernel_Module_Support">DKMS</a> version
of kernel ZFS modules, which means that it packages the source code
of the modules along with directions for how DKMS should (re)build
the binary kernel modules from the source.)</p>
<p>This means that if you want to know what specific version of the
ZFS code is included in any particular Ubuntu kernel and what changed
in it, you need to look at the source package for zfs-dkms, which
is called <a href="https://code.launchpad.net/ubuntu/+source/zfs-linux">zfs-linux</a>
and has its Git repository <a href="https://git.launchpad.net/ubuntu/+source/zfs-linux">here</a>. Don't ask me
how the branches and tags in the Git repository are managed and how
they correspond to released package versions. My current view is
that I will be downloading specific zfs-linux source packages as
needed (using 'apt source zfs-linux').</p>
<p>The zfs-linux source package is also used to build the zfsutils-linux
binary package, which has the user space ZFS tools and libraries.
You might ask if there is anything that makes zfsutils-linux versions
stay in sync with the zfs-dkms versions included in Ubuntu kernels.
The answer, as far as I can see, is no. Ubuntu is free to release
new versions of zfsutils-linux and thus zfs-linux without updating
the kernel's dkms-versions file to use the matching zfs-dkms version.
Sufficiently cautious people may want to specifically install a
matching version of zfsutils-linux and then hold the package.</p>
<p>I was going to write something about how you get the ZFS source for
a particular kernel version, but it turns out that there is no
straightforward way. Contrary to what the Ubuntu documentation
suggests, if you do 'apt source linux-image-unsigned-$(uname -r)',
you don't get the source package for that kernel version, you get
the source package for the current version of the 'linux' kernel
package, at whatever is the latest released version. Similarly,
while you can inspect that source to see what zfs-dkms version it
was built with, 'apt get source zfs-dkms' will only give you (easy)
access to the current version of the zfs-linux source package. If
you ask for an older version, apt will probably tell you it can't
find it.</p>
<p>(Presumably Ubuntu has old source packages somewhere, but I don't
know where.)</p>
</div>
Where and how Ubuntu kernels get their ZFS modules2024-03-07T04:00:21Z2024-03-07T03:59:21Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/DnfFixingStuckUpdateinfocks<div class="wikitext"><p>I apply Fedora updates only by hand, and as part of this I like to
look at what '<code>dnf updateinfo info</code>' will tell me about why they're
being done. For some time, there's been an issue on my work desktop
where 'dnf updateinfo info' would report on updates that I'd already
applied, often drowning out information about the updates that I
hadn't. This was a bit frustrating, because my home Fedora machine
didn't do this but I couldn't spot anything obviously wrong (and at
various times I'd cleaned all of the DNF caches that I could find).</p>
<p>(Now that I look, it seems <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/FedoraNotReadingUpdateinfo">I've been having some variant of this
problem for a while</a>.)</p>
<p>Recently I took another shot at troubleshooting this. In <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OperatorsAndSystemProgrammers">the
system programmer way</a>,
I started by locating the Python source code of the DNF updateinfo
subcommand and reading it. This showed me a bunch of subcommand
specific options that I could have discovered by reading 'dnf
updateinfo --help' and led me to find 'dnf updateinfo list', which
lists which RPM (or RPMs) a particular update will update. When I
used 'dnf updateinfo list' and looked at the list of RPMs, something
immediately jumped out at me, and it turned out to be the cause.</p>
<p><strong>My 'dnf updateinfo info' problems were because I had old Fedora 37
'debugsource' RPMs still installed</strong> (on a machine now running Fedora
39).</p>
<p>The '-debugsource' and '-debuginfo' RPMs for a given RPM contain
symbol information and then source code that is used to allow better
debugging (see <a href="https://docs.fedoraproject.org/en-US/packaging-guidelines/Debuginfo/">Debuginfo packages</a> and
<a href="https://fedoraproject.org/wiki/Changes/SubpackageAndSourceDebuginfo">this change to create debugsource as well</a>). I
tend to wind up installing them if I'm trying to debug a crash in
some standard packaged program, or sometimes code that heavily uses
system libraries. Possibly these packages get automatically cleaned
up if you update Fedora releases in <a href="https://docs.fedoraproject.org/en-US/quick-docs/upgrading-fedora-new-release/">one of the officially supported
ways</a>,
but I do a live upgrade using DNF (following <a href="https://docs.fedoraproject.org/en-US/quick-docs/upgrading-fedora-online/">this Fedora documentation</a>).
Clearly, when I do such an upgrade, these packages are not removed
or updated.</p>
<p>(It's possible that these packages are also not removed or updated
within a specific Fedora release when you update their base packages,
but since they were installed a long time ago I can't tell at this
point.)</p>
<p>With these old debugsource packages hanging around, DNF appears to
have reasonably seen more recent versions of them available and
duly reported the information on the 'upgrade' (in practice the
current version of the package) in 'dnf updateinfo info' when I
asked for it. That the packages would not be updated if I did a
'dnf update' was not updateinfo's problem. Removing the debugsource
packages eliminated this and now 'dnf updateinfo info' is properly
only reporting actual pending updates.</p>
<p>('dnf updateinfo' has various options for what packages to select,
but as covered in <a href="https://dnf.readthedocs.io/en/latest/command_ref.html#updateinfo-command-label">the updateinfo command documentation</a>
apparently they're mostly the same in practice.)</p>
<p>In the future I'm going to have to remember to remove all debugsource
and debuginfo packages before upgrading Fedora releases. Possibly
I should remove them after I'm done with whatever I installed them
for. If I needed them again (in that Fedora release) I'd have to
re-fetch them, but that's rare.</p>
<p>PS: In reading the documentation, I've discovered that it's really
'<code>dnf updateinfo --info</code>'; updateinfo just accepts 'info' (and
'list') as equivalent to the switches.</p>
<p>(This elaborates on <a href="https://mastodon.social/@cks/111967593217874645">a Fediverse post I made at the time</a>.)</p>
</div>
Fixing my problem of a stuck '<code>dnf updateinfo info</code>' on Fedora Linux2024-02-26T21:43:53Z2024-02-24T03:10:47Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSGlobalZILInformationcks<div class="wikitext"><p>The <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSTXGsAndZILs">ZFS Intent Log (ZIL)</a> is effectively
ZFS's version of a filesystem journal, writing out hopefully brief
records of filesystem activity to make them durable on disk before
their full version is committed to the ZFS pool. What the ZIL is
doing and how it's performing can be important for the latency (and
thus responsiveness) of various operations on a ZFS filesystem,
since operations like <code>fsync()</code> on an important file must wait for
the ZIL to write out (<em>commit</em>) their information before they can
return from the kernel. On Linux, <a href="https://openzfs.org/">OpenZFS</a>
exposes global information about the ZIL in <code>/proc/spl/kstat/zfs/zil</code>,
but this information can be hard to interpret without some knowledge
of ZIL internals.</p>
<p>(In OpenZFS 2.2 and later, each dataset also has per-dataset ZIL
information in its kstat file, /proc/spl/kstat/zfs/<pool>/objset-0xXXX,
for some hexadecimal '0xXXX'. There's no overall per-pool ZIL information
the way there is a global one, but for most purposes you can sum up the
ZIL information from all of the pool's datasets.)</p>
<p>The basic background here is <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSZILActivityFlow">the flow of activity in the ZIL</a> and also the comments in <a href="https://github.com/openzfs/zfs/blob/master/include/sys/zil.h">zil.h</a> about
the members of the <code>zil_stats</code> struct.</p>
<p>The (ZIL) data you can find in the "<code>zil</code>" file (and the per-dataset
kstats in OpenZFS 2.2 and later) is as follows:</p>
<ul><li><code>zil_commit_count</code> counts how many times a ZIL commit has been
requested through things like <code>fsync()</code>.</li>
<li><code>zil_commit_writer_count</code> counts how many times the ZIL has actually
committed. More than one commit request can be merged into the same ZIL
commit, if two people <code>fsync()</code> more or less at the same time.<p>
</li>
<li><code>zil_itx_count</code> counts how many <em>intent transactions</em> (itxs) have
been written as part of ZIL commits. Each separate operation (such
as a <code>write()</code> or a file rename) gets its own separate transaction;
these are aggregated together into <em>log write blocks</em> (lwbs) when
a ZIL commit happens.</li>
</ul>
<p>When ZFS needs to record file data into the ZIL, it has three options,
which it calls '<code>indirect</code>', '<code>copied</code>', and '<code>needcopy</code>' in ZIL
metrics. Large enough amounts of file data are handled with an
<em>indirect</em> write, <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSWritesAndZIL">which writes the data to its final location in the
regular pool</a>; the ZIL transaction only
records its location, hence 'indirect'. In a <em>copied</em> write, the data
is directly and immediately put in the ZIL transaction (itx), even
before it's part of a ZIL commit; this is done if ZFS knows that the
data is being written synchronously and it's not large enough to trigger
an indirect write. In a <em>needcopy</em> write, the data just hangs around in
RAM as part of ZFS's regular dirty data, and if a ZIL commit happens
that needs that data, the process of adding its itx to the log write
block will fetch the data from RAM and add it to the itx (or at least
the lwb).</p>
<p>There are ZIL metrics about this:</p>
<ul><li><code>zil_itx_indirect_count</code> and <code>zil_itx_indirect_bytes</code>
count how many indirect writes have been part of ZIL commits, and the
total size of the indirect writes of file data (not of the 'itx' records
themselves, per the comments in <a href="https://github.com/openzfs/zfs/blob/master/include/sys/zil.h">zil.h</a>).<p>
Since these are indirect writes, the data written is not part of
the ZIL (it's regular data blocks), although it is put on disk
as part of a ZIL commit. However, unlike other ZIL data, the data
written here would have been written even without a ZIL commit,
as part of ZFS's regular transaction group commit process. A ZIL
commit merely writes it out earlier than it otherwise would have
been.<p>
</li>
<li><code>zil_itx_copied_count</code> and <code>zil_itx_copied_bytes</code> count how
many 'copied' writes have been part of ZIL commits and the total size
of the file data written (and thus committed) this way.<p>
</li>
<li><code>zil_itx_needcopy_count</code> and <code>zil_itx_needcopy_bytes</code> count
how many 'needcopy' writes have been part of ZIL commits and the total
size of the file data written (and thus committed) this way.</li>
</ul>
<p>A regular system using ZFS may have little or no 'copied' activity.
Our NFS servers all have significant amounts of it, presumably
because some NFS data writes are done synchronously and so this
trickles through to the ZFS stats.</p>
<p>In a given pool, the ZIL can potentially be written to either the
main pool's disks or to a separate log device (a <em>slog</em>, which can
also be mirrored). The ZIL metrics have a collection of
<code>zil_itx_metaslab_*</code> metrics about data actually written to the
ZIL in either the main pool ('normal' metrics) or to a slog (the
'slog' metrics).</p>
<ul><li><code>zil_itx_metaslab_normal_count</code> counts how many ZIL <em>log
write blocks</em> (not ZIL records, itxs) have been committed to the
ZIL in the main pool. There's a corresponding 'slog' version of
this and all further zil_itx_metaslab metrics, with the same
meaning.<p>
</li>
<li><code>zil_itx_metaslab_normal_bytes</code> counts how many bytes have
been 'used' in ZIL log write blocks (for ZIL commits in the main
pool). This is a rough representation of how much space the ZIL
log actually needed, but it doesn't necessarily represent either
the actual IO performed or the space allocated for ZIL commits.<p>
As I understand things, this size includes the size of the intent
transaction records themselves and also the size of the associated
data for 'copied' and 'needcopy' data writes (because these are
written into the ZIL as part of ZIL commits, and so use space in log
write blocks). It doesn't include the data written directly to the
pool as 'indirect' data writes.</li>
</ul>
<p>If you don't use a slog in any of your pools, the 'slog' versions of
these metrics will all be zero. I think that if you have only slogs, the
'normal' versions of these metrics will all be zero.</p>
<p>In ZFS 2.2 and later, there are two additional statistics for
both normal and slog ZIL commits:</p>
<ul><li><code>zil_itx_metaslab_normal_write</code> counts how many bytes have
actually been written in ZIL log write blocks. My understanding
is that this includes padding and unused space at the end of a
log write block that can't fit another record.<p>
</li>
<li><code>zil_itx_metaslab_normal_alloc</code> counts how many bytes of space have
been 'allocated' for ZIL log write blocks, including any rounding up
to block sizes, alignments, and so on. I think this may also be the
logical size before any compression done as part of IO, although I'm
not sure if ZIL log write blocks are compressed.</li>
</ul>
<p>You can see some additional commentary on these new stats (and the
code) in <a href="https://github.com/openzfs/zfs/pull/14863">the pull request</a>
and <a href="https://github.com/openzfs/zfs/commit/b6fbe61fa6a75747d9b65082ad4dbec05305d496">the commit itself</a>.</p>
<p>PS: OpenZFS 2.2 and later has a currently undocumented '<code>zilstat</code>'
command, and its 'zilstat -v' output may provide some guidance on
what ratios of these metrics the ZFS developers consider interesting.
In its current state it will only work on 2.2 and later because it
requires the two new stats listed above.</p>
<h3>Sidebar: Some typical numbers</h3>
<p>Here is the "zil" file from <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">my office desktop</a>,
which has been up for long enough to make it interesting:</p>
<blockquote><pre style="white-space: pre-wrap;">
zil_commit_count 4 13840
zil_commit_writer_count 4 13836
zil_itx_count 4 252953
zil_itx_indirect_count 4 27663
zil_itx_indirect_bytes 4 2788726148
zil_itx_copied_count 4 0
zil_itx_copied_bytes 4 0
zil_itx_needcopy_count 4 174881
zil_itx_needcopy_bytes 4 471605248
zil_itx_metaslab_normal_count 4 15247
zil_itx_metaslab_normal_bytes 4 517022712
zil_itx_metaslab_normal_write 4 555958272
zil_itx_metaslab_normal_alloc 4 798543872
</pre>
</blockquote>
<p>With these numbers we can see interesting things, such as that the
average number of ZIL transactions per commit is about 18 and
that my machine has never done any synchronous data writes.</p>
<p>Here's an excerpt from one of <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our Ubuntu 22.04 ZFS fileservers</a>:</p>
<blockquote><pre style="white-space: pre-wrap;">
zil_commit_count 4 155712298
zil_commit_writer_count 4 155500611
zil_itx_count 4 200060221
zil_itx_indirect_count 4 60935526
zil_itx_indirect_bytes 4 7715170189188
zil_itx_copied_count 4 29870506
zil_itx_copied_bytes 4 74586588451
zil_itx_needcopy_count 4 1046737
zil_itx_needcopy_bytes 4 9042272696
zil_itx_metaslab_normal_count 4 126916250
zil_itx_metaslab_normal_bytes 4 136540509568
</pre>
</blockquote>
<p>Here we can see the drastic impact of NFS synchronous writes (the
significant 'copied' numbers), and also of large NFS writes in
general (the high 'indirect' numbers). This machine has written
many times more data in ZIL commits as 'indirect' writes as it
has written to the actual ZIL.</p>
</div>
What ZIL metrics are exposed by (Open)ZFS on Linux2024-02-26T21:43:53Z2024-02-22T04:44:14Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NetworkManagerDoesNotSharecks<div class="wikitext"><p>Today I upgraded my home desktop to Fedora 39. It didn't entirely
go well; specifically, <a href="https://mastodon.social/@cks/111965809776629255">my DSL connection broke because Fedora
stopped packaging some scripts with rp-pppoe</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NetworkScriptsAndPPPoE">Fedora's
old <code>ifup</code>, which is used by my very old-fashioned setup</a> still requires those scripts. After I got
back on the Internet, I decided to try an idea I'd toyed with,
namely <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NetworkManagerWhyConsidering">using NetworkManager to handle (only) my DSL link</a>. Unfortunately this did not go well:</p>
<blockquote><p>audit: op="connection-activate" uuid="[...]" name="[...]" pid=458524
uid=0 result="fail" reason="Connection '[...]' is not available on
device em0 because device is strictly unmanaged"</p>
</blockquote>
<p>The reason that em0 is 'unmanaged' by NetworkManager is that it's
managed by systemd-networkd, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdNetworkdWhy">which I like much better</a>. Well, also I specifically told NetworkManager
not to touch it by setting it as 'unmanaged' instead of 'managed'.</p>
<p>Although I haven't tested, I suspect that NetworkManager applies
this restriction to all VPNs and other layered forms of networking,
such that you can only run a NetworkManager managed VPN over a
network interface that NetworkManager is controlling. I find this
quite unfortunate. There is nothing that NetworkManager needs to
change on the underlying Ethernet link to run PPPoE or a VPN over
it; the network is a transport (a low level transport in the case
of <a href="https://en.wikipedia.org/wiki/Point-to-Point_Protocol_over_Ethernet">PPPoE</a>).</p>
<p>I don't know if it's theoretically possible to configure NetworkManager
so that an interface is 'managed' but NetworkManager doesn't touch
it at all, so that systemd-networkd and other things could continue
to use em0 while NetworkManager was willing to run PPPoE on top of
it. Even if it's possible in theory, I don't have much confidence
that it will be problem free in practice, either now or in the
future, because fundamentally I'd be lying to NetworkManager and
networkd. If NetworkManager really had a 'I will use this interface
but not change its configuration' category, it would have a third
option besides 'managed or '(strictly) unmanaged'.</p>
<p>(My current solution is a hacked together script to start pppd and
pppoe with magic options researched through <a href="https://github.com/leahneukirchen/extrace">extrace</a> and a systemd service
that runs that script. I have assorted questions about how this is
going to interactive with <a href="https://mastodon.social/@cks/111966685915895435">various things</a>, but someday I
will get answers, or perhaps unpleasant surprises.)</p>
<p>PS: Where this may be a special problem someday is if I want to run
a VPN over my DSL link. I can more or less handle running PPPoE by
hand, but the last time I looked at a by hand OpenVPN setup I rapidly
dropped the idea. NetworkManager is or would be quite handy for this
sort of 'not always there and complex' networking, but it apparently
needs to own the entire stack down to Ethernet.</p>
<p>(To run a NetworkManager VPN over 'ppp0', I would have to have
NetworkManager manage it, which would presumably require I have
NetworkManager handle the PPPoE DSL, which requires NetworkManager
not considering em0 to be unmanaged. It's NetworkManager all the
way down.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NetworkManagerDoesNotShare?showcomments#comments">2 comments</a>.) </div>NetworkManager won't share network interfaces, which is a problem2024-02-26T21:43:53Z2024-02-21T03:55:01Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSPoolTXGsInformationcks<div class="wikitext"><p>As part of (Open)ZFS's general 'kstats' system for reporting
information about ZFS overall and your individual pools and datasets,
there is a per-pool /proc file that reports information about the
most recent N <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSTXGsAndZILs">transaction groups ('txgs')</a>, /proc/spl/kstat/zfs/<pool>/txgs.
How many N is depends on the <a href="https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-txg-history">zfs_txg_history</a>
parameter, and defaults to 100. The information in here may be quite
important for diagnosing certain sorts of performance problems but
I haven't found much documentation on what's in it. Well, let's try
to fix that.</p>
<p>The overall format of this file is:</p>
<blockquote><pre style="white-space: pre-wrap;">
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
5846176 7976255438836187 C 1736704 0 5799936 0 299 5119983470 2707 49115 27910766
[...]
5846274 7976757197601868 C 1064960 0 4702208 0 236 5119973466 2405 48349 134845007
5846275 7976762317575334 O 0 0 0 0 0 0 0 0 0
</pre>
</blockquote>
<p>(This example is coming from a system with four-way mirrored vdevs,
which is going to be relevant in a bit.)</p>
<p>So lets take these fields in order:</p>
<ol><li><code>txg</code> is the transaction group number, which is a steadily increasing
number. The file is ordered from the oldest txg to the newest, which
will be the current open transaction group.<p>
(In the example, txg 5846275 is the current open transaction group
and 5846274 is the last one the committed.)<p>
</li>
<li><code>birth</code> is the time when the transaction group (txg) was 'born', in
<em>nanoseconds</em> since the system booted.<p>
</li>
<li><code>state</code> is the current state of the txg; this will most often be either
'C' for committed or 'O' for open. You may also see 'S' for
syncing, 'Q' (being quiesced), and 'W' (waiting for sync). An
open transaction group will most likely have 0s for the rest of
the numbers, and will be the last txg (there's only one open txg
at a time). <strike>Any transaction group except the second last will be
in state 'C', because you can only have one transaction group in
the process of being written out.</strike><p>
Update: per the comment from Arnaud Gomes, you can have multiple
transaction groups at the end that aren't committed. I believe you
can only have one that is syncing ('S'), because that happens in a
single thread for only one txg, but you may have another that is
quiescing or waiting to sync.<p>
A transaction group's progress through its life cycle is open,
quiescing, waiting for sync, syncing, and finally committed. In
the open state, additional transactions (such as writing to files
or renaming them) can be added to the transaction group; once a
transaction group has been quiesced, nothing further will be added
to it.<p>
(See also <a href="https://www.delphix.com/blog/zfs-fundamentals-transaction-groups">ZFS fundamentals: transaction groups</a>,
which discusses how a transaction group can take a while to sync;
the content has also been added as a comment in the source
code in <a href="https://github.com/openzfs/zfs/blob/master/module/zfs/txg.c">txg.c</a>.)<p>
</li>
<li><code>ndirty</code> is how many bytes of directly dirty data had to be written
out as part of this transaction; these bytes come, for example, from
user <code>write()</code> IO.<p>
It's possible to have a transaction group commit with a '0' for
<code>ndirty</code>. I believe that this means no IO happened during the
time the transaction group was open, and it's just being closed
on the timer.<p>
</li>
<li><code>nread</code> is how many bytes of disk reads the pool did between when
syncing of the txg starts and when it finishes ('during txg sync').</li>
<li><code>nwritten</code> is how many bytes of disk writes the pool did during txg sync.</li>
<li><code>reads</code> is the number of disk read IOs the pool did during txg sync.</li>
<li><code>writes</code> is the number of disk write IOs the pool did during txg sync.<p>
I believe these IO numbers include at least any extra IO needed
to read in on-disk data structures to allocate free space and any
additional writes necessary. I also believe that they track actual
bytes written to your disks, so for example with two-way mirrors
they'll always be at least twice as big as the <code>ndirty</code> number
(in my example above, with four way mirrors, their base is four
times <code>ndirty</code>).<p>
As we can see it's not unusual for <code>nread</code> and <code>reads</code> to be zero.
However, I don't believe that the read IO numbers are restricted
to transaction group commit activities; if something is reading
from the pool for other reasons during the transaction group commit,
that will show up in <code>nread</code> and <code>reads</code>. They are thus a measure
of the amount of read IO going during the txg sync process, not
the amount of IO necessary for it.<p>
I don't know if ongoing write IO to the ZFS Intent Log can happen
during a txg sync. If it can, I would expect it to show up in the
<code>nwritten</code> and <code>writes</code> numbers. Unlike read IO, regular write
IO can only happen in the context of a transaction group and so
by definition any regular writes during a txg sync are part of
that txg and show up in <code>ndirty</code>.<p>
</li>
<li><code>otime</code> is how long the txg was open and accepting new write IO, in
nanoseconds. Often this will be around the default <a href="https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-txg-timeout">zfs_txg_timeout</a>
time, which is normally five seconds. However, under (write) IO
pressure this can be shorter or longer (if the current open transaction
group can't be closed because there's already a transaction group in
the process of trying to commit).<p>
</li>
<li><code>qtime</code> is how long the txg took to be quiesced, in nanoseconds; it's
usually small.</li>
<li><code>wtime</code> is how long the txg took to wait to start syncing, in nanoseconds;
it's usually pretty small, since all it involves is that the
separate syncing thread pick up the txg and start syncing it.<p>
</li>
<li><code>stime</code> is how long the txg took to actually sync and commit, again
in nanoseconds. It's often appreciable, since it's where the actual
disk write IO happens.</li>
</ol>
<p>In the example "txgs" I gave, we can see that despite the first
committed txg listed having more dirty data than the last committed
txg, its actual sync time was only about a quarter of the last txg's
sync time. This might cause you to look at underlying IO activity
patterns, latency patterns, and so on.</p>
<p>As far as I know, there's no per-pool source of information about
the current amount of dirty data in the current open transaction
group (although once a txg has quiesced and is syncing, I believe
you do see a useful <code>ndirty</code> for it in the "txgs" file). A system
wide dirty data number can more or less be approximated from <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCMemoryReclaimStats">the
ARC memory reclaim statistics</a> in
the <code>anon_size</code> kstat plus the <code>arc_tempreserve</code> kstat, although
the latter seems to never get very big for us.</p>
<p>A new transaction group normally opens as the current transaction
group begins quiescing. We can verify this in the example output
by adding the birth time and the <code>otime</code> of txg 5846274, which add
up to exactly the birth time of txg 5846275, the current open txg.
If this sounds suspiciously exact down to the nanosecond, that's
because the code involve freezes the current time at one point and
uses it for both the end of the open time of the current open txg
and the birth time of the new txg.</p>
<h3>Sidebar: the progression through transaction group states</h3>
<p>Here is what I can deduce from reading through the OpenZFS kernel
code, and since I had to go through this I'm going to write it down.</p>
<p>First, although there is a txg 'birth' state, 'B' in the 'state'
column, you will never actually see it. Transaction groups are born
'open', per spa_txg_history_add() in <a href="https://github.com/openzfs/zfs/blob/master/module/zfs/spa_stats.c">spa_stats.c</a>.
Transaction groups move from 'O' open to 'Q' quiescing in
txg_quiesce() in <a href="https://github.com/openzfs/zfs/blob/master/module/zfs/txg.c">txg.c</a>, which
'blocks until all transactions in the group are committed' (which
I believe means they are finished fiddling around adding write IO).
This function is also where the txg finishes quiescing and moves
to 'W', waiting for sync. At this point the txg is handed off to
the 'sync thread', txg_sync_thread() (also in <a href="https://github.com/openzfs/zfs/blob/master/module/zfs/txg.c">txg.c</a>). When
the sync thread receives the txg, it will advance the txg to 'S',
syncing, call spa_sync(), and then mark everything as done,
finally moving the transaction group to 'C', committed.</p>
<p>(In the <a href="https://github.com/openzfs/zfs/blob/master/module/zfs/spa_stats.c">spa_stats.c</a> code, the txg state is advanced by a call
to spa_txg_history_set(), which will always be called with the
old state we are finishing. Txgs advance to syncing in
spa_txg_history_init_io(), and finish this state to move to
committed in spa_txg_history_fini_io(). The tracking of read
and write IO during the txg sync is done by saving a copy of
the top level vdev IO stats in spa_txg_history_init_io(),
getting a second copy in spa_txg_history_fini_io(), and then
computing the difference between the two.)</p>
<p>Why it might take some visible time to quiesce a transaction group
is more or less explained in the description of how ZFS's implementations
of virtual filesystem operations work, in the comment at the start
of <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/zfs_vnops_os.c">zfs_vnops_os.c</a>.
Roughly, each operation (such as creating or renaming a file) starts
by obtaining a transaction that will be part of the currently open
txg, then doing its work, and then committing the transaction. If
the transaction group starts quiescing while the operation is doing
its work, the quiescing can't finish until the work does and commits
the transaction for the rename, create, or whatever.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSPoolTXGsInformation?showcomments#comments">2 comments</a>.) </div>What is in (Open)ZFS's per-pool "txgs" /proc file on Linux2024-02-26T21:43:53Z2024-02-14T03:26:14Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/AMDWithECCKernelMessagescks<div class="wikitext"><p>In general, <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/UseECCIrritation">consumer x86 desktops have generally not supported
ECC memory</a>, at least not if you wanted
the 'ECC' bit to actually do anything. <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/IntelCPUSegmentationIrritation">With Intel this seems to
have been an issue of market segmentation</a>, but things with AMD were
more confusing. The <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/RyzenMemorySpeedAndECC">initial AMD Ryzen series seemed to generally
support ECC in the CPU</a>, but the
motherboard support was questionable, and even if your motherboard
accepted ECC DIMMs there was an open question of whether the ECC
was doing anything on any particular motherboard (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/ECCRAMSupportLevels">cf</a>). Later Ryzens have apparently had an
even more confusing ECC support story, but I'm out of touch on that.</p>
<p>When we put together <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">my work desktop</a> we got ECC
DIMMs for it and I thought that theoretically the motherboard
supported ECC, but I've long wondered if it was actually doing
anything. Recently I was looking into this a bit <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/MyMachineDesires2024">for reasons</a> and ran across Rain's <a href="https://sunshowers.io/posts/am5-ryzen-7000-ecc-ram/">ECC RAM on AMD Ryzen
7000 desktop CPUs</a>,
which contained some extremely useful information about how to tell
from your boot messages on AMD systems. I'm going to summarize this
and add some extra information I've dug out of things.</p>
<p>Modern desktop CPUs talk to memory themselves, but not quite directly
from the main CPU; instead, they have a separate on-die memory
controller. On AMD Zen series CPUs, this is the AMD <a href="https://github.com/oxidecomputer/illumos-gate/blob/5f01ecd8941eadb64bc15b1a02c468604c1a503e/usr/src/uts/intel/sys/amdzen/umc.h#L22">Unified Memory
Controller</a>,
and there are special interfaces to talk to it. As I understand
things, ECC is handled (or not) in the UMC, where it receives the
raw bits from your DIMMs (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CheckingRAMDIMMInfo">if your DIMMs are wide enough, which
you may or may not be able to tell</a>). Therefor,
to have ECC support active, you need ECC DIMMs and for ECC to be
enabled in your UMC (which I believe is typically controlled by
the BIOS, assuming the UMC supports ECC, which depends on the CPU).</p>
<p>In Linux, reporting and managing ECC is handled through a general
subsystem called <a href="https://www.kernel.org/doc/html/latest/driver-api/edac.html">EDAC</a>, with
specific hardware drivers. The normal AMD EDAC driver is amd64_edac,
and <a href="https://sunshowers.io/posts/am5-ryzen-7000-ecc-ram/">as covered by Rain</a>, it registers
for memory channels only if the memory channel has ECC on in the
on-die UMC. When this happens, you will see a kernel message to the
effect of:</p>
<blockquote><pre style="white-space: pre-wrap;">
EDAC MC0: Giving out device to module amd64_edac controller F17h: DEV 0000:00:18.3 (INTERRUPT)
</pre>
</blockquote>
<p>It follows that if you do see this kernel message during boot, you
almost certainly have fully supported ECC on your system. It's very
likely that your DIMMs are ECC DIMMs, your motherboard supports ECC
in the hardware and in its BIOS (and has it enabled in the BIOS if
necessary and applicable), and your CPU is willing to do ECC with
all of this. Since the above kernel message comes from <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">my office
desktop</a>, it seems almost certain that it does
indeed fully support ECC, although I don't think I've ever seen
any kernel messages about detecting and correcting ECC issues.</p>
<p>You can see more memory channels in larger systems and they're not
necessarily sequential; one of our large AMD machines has 'MC0' and
'MC2'. You may also see a message about 'EDAC PCI0: Giving out
device to [...]', which is about a different thing.</p>
<p>In the normal Linux kernel way, various EDAC memory controller
information can be found in sysfs under /sys/devices/system/edac/mc
(assuming that you have anything registered, which you may not on
a non-ECC system). This appears to include counts of corrected
errors and uncorrected errors both at the high level of an entire
memory controller and at the level of 'rows', 'ranks', and/or 'dimms'
depending on the system and the kernel version. You can also see
things like the memory EDAC mode, which could be 'SECDED' (what
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">my office desktop</a> reports) or 'S8ECD8ED' (what
a large AMD server reports).</p>
<p>(The 'MC<n>' number reported by the kernel at boot time doesn't
necessarily match the /sys/devices/system/edac/mc<n> number. We
have systems which report 'MC0' and 'MC2' at boot, but have 'mc0'
and 'mc1' in sysfs.)</p>
<p>The <a href="https://github.com/prometheus/node_exporter">Prometheus host agent</a>
exposes this EDAC information as metrics, primarily in
node_edac_correctable_errors_total and
node_edac_uncorrectable_errors_total. We have seen a few corrected
errors over time on one particular system.</p>
<h3>Sidebar: EDAC on Intel hardware</h3>
<p>While there's an Intel memory controller EDAC driver, I don't know
if it can get registered even if you don't have ECC support. If
it is registered with identified memory controllers, and you can
see eg 'SECDED' as the EDAC mode in /sys/devices/system/edac/mc/mcN,
then I think you can be relatively confident that you have ECC
active on that system. On <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2018">my home desktop</a>, which
definitely doesn't support ECC, what I see on boot for EDAC (with
Fedora 38's kernel 6.7.4) is:</p>
<blockquote><pre style="white-space: pre-wrap;">
EDAC MC: Ver: 3.0.0
EDAC ie31200: No ECC support
EDAC ie31200: No ECC support
</pre>
</blockquote>
<p>As expected there are no 'mcN' subdirectories in
/sys/devices/system/edac/mc.</p>
<p>Two Intel servers where I'm pretty certain we have ECC support report,
respectively:</p>
<blockquote><pre style="white-space: pre-wrap;">
EDAC MC0: Giving out device to module skx_edac controller Skylake Socket#0 IMC#0: DEV 0000:64:0a.0 (INTERRUPT)
</pre>
</blockquote>
<p>and</p>
<blockquote><pre style="white-space: pre-wrap;">
EDAC MC0: Giving out device to module ie31200_edac controller IE31200: DEV 0000:00:00.0 (POLLED)
</pre>
</blockquote>
<p>As we can see here, Intel CPUs have more than one EDAC driver, depending
on CPU generation and so on. The first EDAC message comes from a system
with a Xeon Silver 4108, the second from a system with a Xeon E3-1230 v5.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/AMDWithECCKernelMessages?showcomments#comments">One comment</a>.) </div>Linux kernel boot messages and seeing if your AMD system has ECC2024-02-26T21:43:53Z2024-02-13T03:37:18Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/HomeBackupPlans2024cks<div class="wikitext"><p>In theory, what I should do to back up <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2018">my home desktop</a>
is fairly straightforward. I should get one or two USB hard drives
of sufficient size, then periodically connect one and do a backup
to it (probably using tar, and potentially not compressing the tar
archives to make them more recoverable in the face of disk errors).
If I'm energetic, I'll have two USB hard drives and periodically
rotate one to the office as an offsite backup. <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/SortingOutModernUSB">Modern USB</a> should be fast enough for this,
and hopefully using (fast) USB drives will no longer <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/USBDrivesKillMyPerformance">kill my
performance the way it used to</a>.
Large HDDs are reasonably affordable, especially if I decide
to live with 5400 RPM ones (which I hope run cooler), so I
could store multiple full system backups on a single HDD.</p>
<p>In practice this is a lot of things to remember to do on a regular
basis, and although I have some of the pieces (and have for years),
those pieces have dust on them from disuse. So this approach isn't
workable as a way to get routine backups; at best I might manage
to do it once every few months. So instead I long ago came up with
a plan that is not so much better as more likely to succeed. The
short version of the plan is that I will make backups to an additional
live HDD in my home desktop.</p>
<p>My home desktop's storage used to be a mirrored pair of SSDs and a
mirrored but mismatched pair of HDDs. Back in early 2023, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidSwitchingDisks">this
became all solid state</a>, with a pair
of NVMe drives and a pair of SSDs (not the same SSDs, the new pair
is much larger). This leaves me with an unused 4 TB HDD, which I
actually (still) have in the case. So I can reuse this 4 TB HDD as
an always-live backup drive, or what is really 'a second copy'
drive. Because the drive will always be there and live, I can
automate copies to it, run them from cron, and more or less forget
about it (once it's working).</p>
<p>The obvious and most readily automated way to make the backups is
to use ZFS snapshots. I'll make a new ZFS pool on the HDD, and then
use snapshots with 'zfs send' and 'zfs receive' to move them from
the solid state storage to the HDD pool. ZFS's read only snapshots
will insure that I can't accidentally damage the backup copies, and
I can scrub the HDD's ZFS pool periodically as insurance against
disk corruption. My total space usage in both my current solid
state ZFS pools is still a bit under 2 TB, so I should have plenty
of space for both on a 4 TB HDD.</p>
<p>This is obviously imperfect, since various sorts of problems could
cost me both the live storage and the HDD, and I could have ZFS
problems too. But it's a lot better than nothing, and <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/PerfectionTrap">sometimes
the perfect is the enemy of the good</a>.</p>
<p>(Having written this, perhaps I will actually implement it. The
current obstacle is that the old HDDs are still running my old LVM
setup, as backup for the ZFS pool I created on the new SSDs and
then theoretically moved all of the LVM's contents to. So I'd have
to hold my breath and tear down those filesystems and the LVM storage
first. Destroying even supposedly completely surplus data makes me
twitch just a bit, and so far it's been easier to do nothing.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeBackupPlans2024?showcomments#comments">5 comments</a>.) </div>My plan for backups of my home machine (as of early 2024)2024-02-26T21:43:53Z2024-02-11T03:00:55Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/MyMachineDesires2024cks<div class="wikitext"><p>My current <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">work desktop</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2018">home desktop</a> are getting somewhat long in the tooth, which has
caused me to periodically think about what I'd want in new hardware
for them. Sometimes I even look at potential hardware choices for
such a replacement desktop (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/CPUIGPCoolingAdvantage">which can lead to grumbling</a>). Today I want to write down my
ideal broad specifications for such a new desktop, what I'd get if
I could get it all in one spot for an affordable price.</p>
<p>In addition to all of the expected things (like onboard sound),
I'd like:</p>
<ul><li>64 GB of RAM instead of my current 32 GB. It would be nice if it
was ECC RAM in a system that genuinely supported it, and it would
also be nice if it was fast, but those two attributes are often in
opposition to each other.<p>
(Today I suspect this means choosing DDR5 over DDR4.)<p>
</li>
<li>Three motherboard M.2 NVMe drive slots. I'd like three because I
currently have a mirrored pair of NVMe drives, and having a third
slot would let me replace one of the live two without having to
pull it outright. Two motherboard M.2 NVMe slots (both operating
at PCIe x4) is probably my minimum these days, and I already have
a PCIe M.2 NVMe card for the current work desktop.<p>
My work desktop has 500 GB NVMe drives currently and I'd like to
get bigger ones. My home desktop is fine with its current drives.<p>
</li>
<li>At least four SATA ports and ideally more. My office desktop has
two SSDs and a SATA DVD-RW drive (because we still sometimes use
those), and I want to be able to run three SSDs at once while
replacing one of the two SSDs. Six SATA ports would be better,
so perhaps I should say I can live with four SATA ports but I'd
like six.<p>
(My home desktop will also need three SATA ports on a routine
basis with a fourth available for drive replacement, but that's
for another entry.)<p>
</li>
<li>At least three 1G Ethernet ports for my work desktop. Since I don't
think there are any reasonable desktop motherboards with this
many Ethernet ports, this needs at least a dual-port PCIe card
and perhaps a quad-port card, which I already have at work. It
also needs a suitable PCIe slot to be free and usable given any
other cards in the machine. My home desktop can get by with one
port but I'd probably like to have two or three there too.<p>
(I wouldn't need that many but <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/VirtManagerMySetupSoFar">Linux's native virtualization
works best if you give it its own network port</a>.)<p>
Although various desktop motherboards have started offering
speeds above 1G (although often not full 10G-T), our work
wiring situation is such that there's no real prospect of
taking advantage of that any time soon. But if a motherboard
comes with '2.5G' or '5G' networking with a chipset that's
decent and well supported by Linux, I wouldn't say no.<p>
</li>
<li>At least two DisplayPort and/or HDMI outputs that support at least
4K at 60 Hz, and I'd like more for future-proofing. I would prefer
two DisplayPort outputs to a DisplayPort + HDMI pairing; this is
readily available in GPU cards but not really in motherboards and
integrated graphics. At work I currently have two 27" HiDPI
displays and at home I currently have one; in both locations the
biggest constraint on larger displays or more of them is physical
space.<p>
(I'd love it if we were moving into a bright future of high
resolution, high DPI, high refresh rate displays, but I don't
think we are, so I don't really expect to want more than dual 4K
at 60Hz for the next half decade or more. It's possible this is
too pessimistic and there are viable 5K+ monitors that I might
want at home in place of my current 27" 4K HiDPI display.)<p>
</li>
<li>Open source friendly graphics, which in practice excludes Nvidia
GPUs (especially if I care about good Wayland support), and
possibly the discrete Intel GPU cards (I'm not sure of their
state). I think anything reasonably modern will support whatever
OpenGL features Wayland needs or is likely to need. The easy way
to get this might well be integrated graphics on a current
generation CPU, assuming I can get the output ports that I want.<p>
On the other hand, the Intel ARC A380 seems to be okay on Linux
(from some Internet searches), and while it has a fan it's alleged
to be able to operate very quietly. It would give me the multiple
DisplayPort outputs and high resolution, high refresh rate support.<p>
</li>
<li>A decent number of both USB-A and USB-C ports. I'd like a reasonable
number of USB-A ports because I still have a lot of USB-A things
and I'd like not to have a whole collection of USB-A hubs sitting
around on my either my office or my home desk. But probably more
hubs (or larger ones) is in my future.</li>
</ul>
<p>I'd like it if the machine still supported old fashioned BIOS MBR
booting and didn't require (U)EFI booting (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/BIOSMBRBootingOverUEFI">I have my reasons</a>), although UEFI booting is
probably better on desktop motherboards <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/MBRToUEFIBootFailure">than it used to be</a>. The UEFI story for people who want booting
from mirrored pairs of drives may be better on Fedora than it used
to be, since <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Ubuntu2204MultiDiskUEFI">Ubuntu 22.04 has some support for duplicate UEFI
boot partitions</a>.</p>
<p>(I'm absolutely not interested in trying to mirror the EFI System
Partition behind the back of the UEFI BIOS.)</p>
<p>It would be nice to get a good CPU performance increase from my
current desktops, but on the one hand I sort of assume that any
decent desktop CPU today is going to be visibly better than something
from more than five years ago, and on the other hand I'm not sure
how noticeable the performance improvement is these days, and on the
third hand <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/ChangingComputerPerformance">I've been wrong before</a>.
If my current (five year old) desktops have reached the point where
CPU performance mostly doesn't matter to me, then I'd probably
prefer to get a midrange CPU with decent thermal performance and
perhaps no funny slow 'efficiency' cores that can give you and
Linux's kernel CPU scheduling various sorts of heartburn. On the
other hand, my Firefox build times keep getting slower and slower,
so I suspect that the world of software just assumes current CPUs
and current good performance.</p>
<p>PS: I have no plans to do GPU computation on my desktops, for a
variety of reasons including that I don't want to deal with Nvidia
GPUs in my machines. If I need to do GPU stuff for work, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SlurmHowWeUseIt">our SLURM
cluster</a> has GPUs, and I don't have to
care how much power they use, how noisy they are, and how much heat
they put out because they're in the machine room (and I'm not).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/MyMachineDesires2024?showcomments#comments">7 comments</a>.) </div>What I'd like in a hypothetical new desktop machine in 20242024-02-26T21:43:53Z2024-02-08T04:50:44Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NFSv4MaxConnectEffectscks<div class="wikitext"><p>Suppose, not hypothetically, that you've converted your fleet from
using NFS v3 to using <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSv4BasicsJustWork">basic Unix security NFS v4 mounts</a> when they mount their hordes of NFS filesystems
from <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">your NFS fileservers</a>. When your NFS
clients boot or at some other times, you notice that you're getting
a bunch of copies of a new kernel message:</p>
<blockquote><pre style="white-space: pre-wrap;">
SUNRPC: reached max allowed number (1) did not add transport to server: <IP address>
</pre>
</blockquote>
<p>Modern NFS uses TCP, which means that the NFS client needs to make
some number of TCP connections to each NFS server. In NFS v3, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSOneTCPConnectionToAServer">Linux
normally only makes one connection to each server</a>. The same is sort of true in NFS v4
as well, but NFS v4 is more complex about what is 'a server'. In
NFS v3, servers are identified by at least their IP address (and
perhaps their name; I'm not sure if two different names that map
to the same IP will share the same connection). In NFS v4.1+, servers
have some sort of intrinsic identity that is visible to clients
even if you're talking to them by multiple IP addresses.</p>
<p>This new 'reached max allowed number (<N>) did not add transport
to server' kernel message is reporting about this case. You (we)
have a single NFS server that for historical reasons has two different
IPs, one for most of its filesystems and one for <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OurPasswordPropagation">our central
administrative filesystem</a>,
and now NFS v4 considers these the 'same' server and won't make an
extra connection to the second IP.</p>
<p>You might wonder if you can change this, and the answer is that you
can but it gets complex and I'm not quite sure how it all works to
distribute the actual NFS traffic. There appear to be two interlinked
things that you can control; how many connections a NFS v4 client
will make to a single NFS server, and how many different IPs of the
server that NFS v4 client will connect to. How many connections NFS
v4 will make to a single server is mostly controlled by <a href="https://man7.org/linux/man-pages/man5/nfs.5.html">nfs(5)</a>'s <code>nconnect</code>
setting, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSv3NConnectEffects">sort of like <code>nconnect</code>'s behavior with NFS v3</a>. How many connections NFS v4 will make to
separate client IPs is controlled by '<code>max_connect</code>'. Both of
these default to 1. However, how they interact is confusing and I'm
not sure I fully understand it.</p>
<p>The easy case is not setting nconnect and setting max_connect
to at least as many different IP aliases as you have for each
fileserver. In this case you'll get one TCP connection per server
IP (although don't ask me what traffic flows over what connection).
If you set nconnect without max_connect, you'll get however
many connections to the first IP address of each server (well, the
first IP address that the client finds), assuming that you mount
at least that many NFS filesystems from that server.</p>
<p>However, if you set both nconnect and max_connect, what seems
to happen (on Ubuntu 22.04) is that you get nconnect TCP
connections to each server's first (encountered) IP address, and
then one TCP connection to every other IP address (up to the
max_connect limit). This is why I described 'nconnect' as
controlling how many connections NFS v4 would make to a single
server, instead of a single server IP (or name). It would be a bit
more useful if you could set nconnect on a per-IP (or name) basis
in NFS v4, or otherwise make it so that the first IP didn't get all
of the connections.</p>
<p>(This is apparently called 'trunking' in NFS v4, per <a href="https://datatracker.ietf.org/doc/html/rfc5661#section-2.10.5">RFC 5661
section 2.10.5</a>
(<a href="https://www.truenas.com/community/threads/nfsv4-1-session-trunking-multipath-support-not-nconnect-or-pnfs.112215/">via</a>).)</p>
</div>
What the <code>max_connect</code> Linux NFS v4 mount parameter seems to do2024-02-26T21:43:53Z2024-02-07T03:49:05Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/PSIIRQNumbersAndMeaningscks<div class="wikitext"><p>For some time, the Linux kernel has had both general and per-cgroup
'<a href="https://www.kernel.org/doc/html/latest/accounting/psi.html">Pressure Stall Information</a>', which
is intended to tell you something about when things on your system
are stalling on various resources. The initial implementation
provided this information for cpu usage, obtaining memory, and
waiting on IO, as I wrote up in <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PSINumbersAndMeanings">my notes on PSI</a>.
In kernel 6.1, an additional PSI file was added, 'irq' (if your
kernel is built with CONFIG_IRQ_TIME_ACCOUNTING, which current
Fedora kernels are).</p>
<p>One important reference for this is <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=52b1364ba0b105122d6de0e719b36db705011ac1">the kernel commit that added
this feature</a>.
Another is Eva Lacy's <a href="https://www.lacy.ie/technology/2023/10/22/pressure-stall-information.html">Pressure Stall Information in Linux</a>.
However, both of these can be a little opaque about what's actually
being calculated and reported in 'irq'.</p>
<p>The /proc/pressure/irq file will typically look like the other pressure
files, with the exception that it only has a 'full' line:</p>
<blockquote><pre style="white-space: pre-wrap;">
full avg10=0.00 avg60=0.00 avg300=0.00 total=3753500244
</pre>
</blockquote>
<p><a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PSINumbersAndMeanings">As usual</a>, the 'total=' number is the
cumulative time in microseconds that tasks have been stalled on IRQ
or soft IRQs. What 'stalled' means here is that at the end of every
round of IRQ and softirq handling, the kernel works out the total
amount of time that it spent doing this (the 'delta time' in the
commit message), looks to see if there's a meaningful current task
(I believe 'on this CPU'), and if there is, the time is added to
'total'.</p>
<p>There is no 'some' line for the inverse reason of <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PSICpuWhyNoFull">why there's
no 'full' line in the global 'cpu' pressure file</a>.
In the CPU case, there's always something running (globally), so
you can't have a complete stall on CPU the way you can have on
memory or IO, where all tasks could be waiting to get more memory
or have their IO complete. In the case of IRQ handling, either there
was no task running (on the CPU), in which case nothing is impeded
by the IRQ handling time, or there was a task running at the time
the IRQ handling happened, in which case it completely stalled for
the duration.</p>
<p>If I'm understanding all of this correctly, one corollary is that
'irq' pressure only happens to the extent that your system is busy.
Given a fixed amount of time spent handling IRQs and softirqs, the
amount of that time that shows up in /proc/pressure/irq depends on
how often it's interrupting a (running) task, which depends on how
many running tasks you have. On an idle system, the IRQ and softirq
time isn't preempting anything and it's 'free', at least from the
perspective of the PSI system.</p>
<p>Based on reading <a href="https://man7.org/linux/man-pages/man5/proc.5.html">proc(5)</a>, you can get
the total amount of time that the system has spent handling IRQs
and softirqs from the 6th and 7th numbers on the first 'cpu' line
in /proc/stat (the 6th number will be zero if IRQ time accounting
isn't enabled for your kernel). On most machines, this will be in
units of 100ths of a second. You can then cross-compare this to the
total in /proc/pressure/irq. On my home Fedora machine (the one the
sample line comes from), the irq pressure time is about 3% of the
total IRQ handling time; on my work desktop, it's currently about
6%.</p>
<p>(I suspect that all of this means that /proc/pressure/irq won't be
very interesting on many systems, which is good because tools like
<a href="https://github.com/prometheus/node_exporter">the Prometheus host agent</a>
may not have been updated to report it.)</p>
<p>PS: Ubuntu 22.04 kernels don't set CONFIG_IRQ_TIME_ACCOUNTING,
although they're too old to have /proc/pressure/irq. As far as I
can tell, this is still the case in the future 24.04 kernel (<a href="https://en.wikipedia.org/wiki/Ubuntu_version_history">'Noble
Numbat'</a>, and
thus 'noble' on places like <a href="https://packages.ubuntu.com/">packages.ubuntu.com</a>). This is potentially a little bit
unfortunate, but <a href="https://tanelpoder.com/posts/linux-hiding-interrupt-cpu-usage/">it's apparently been this way for some
time</a>.</p>
</div>
Notes on the Linux kernel's 'irq' pressure stall information and meaning2024-02-26T21:43:53Z2024-01-19T03:24:00Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/CgroupV2InterestingMetricscks<div class="wikitext"><p>In <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusExporters-2023">my roundup of what Prometheus exporters we use</a>, I mentioned that we didn't have
a way of generating resource usage metrics for systemd services,
which in practice means <a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html">unified cgroups (cgroup v2)</a>. This
raises the good question of what resource usage and performance
metrics are available in cgroup v2 that one might be interested in
collecting for systemd services.</p>
<p>You can want to know about resource usage of systemd services (or
more generally, systemd units) for a variety of reasons. Our reason
is generally to find out what specifically is using up some resource
on a server, and more broadly to have some information on how much
of an impact a service is having. I'm also going to assume that all
of the relevant cgroup resource controllers are enabled, which is
increasingly the case on systemd based systems.</p>
<p>In each cgroup, you get the following:</p>
<ul><li><a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PSINumbersAndMeanings">pressure stall information</a> for CPU,
memory, IO, and these days IRQs. This should give you a good idea of
where contention is happening for these resources.<p>
</li>
<li><a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#cpu-interface-files">CPU usage information</a>,
primarily the classical count of user, system, and total usage.<p>
</li>
<li><a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#io-interface-files">IO statistics (if you have the right things enabled)</a>,
which are enabled on some but not all of our systems. For us, this
appears to have the drawback that it doesn't capture information
for NFS IO, only local disk IO, and it needs decoding to create
useful information (ie, information associated with a named device,
which you find out the mappings for from /proc/partitions and
/proc/self/mountinfo).<p>
(This might be more useful for virtual machine slices, where it
will probably give you an indication of how much IO the VM is doing.)<p>
</li>
<li><a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory">memory usage information</a>,
giving both a simple amount assigned to that cgroup ('<code>memory.current</code>')
and a relatively detailed breakdown of how much of what sorts of memory
has been assigned to the cgroup ('<code>memory.stat</code>'). As I've found out
repeatedly, the simple number can be misleading depending on what you
want to really know, because it includes things like <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CgroupsMemoryUsageAccounting">inactive file
cache</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CgroupsMemoryUsageAccountingII">inactive, reclaimable kernel
slab memory</a>.<p>
(You also get swap usage, in '<code>memory.swap.current</code>', and there's
also '<code>memory.zswap.current</code>'.)<p>
In a Prometheus exporter, I might simply report all of the entries in
<code>memory.stat</code> and sort it out later. This would have the drawback of
creating a bunch of time series, but it's probably not an overwhelming
number of them.</li>
</ul>
<p>Although the cgroup doesn't directly tell you how many processes
and threads it contains, you can read '<code>cgroup.procs</code>' and
'<code>cgroups.threads</code>' to count how many entries they have. It's
probably worth reporting this information.</p>
<p>The root cgroup has some or many of these files, depending on your
setup. Interestingly, in Fedora and Ubuntu 22.04, it seems to have
an '<code>io.stat</code>' even when other cgroups don't have it, although I'm
not sure how useful this information is for the root cgroup.</p>
<p>Were I to write a systemd cgroup metric collector, I'd probably
only have it report on first level and second level units (so
'systemd.slice' and then 'cron.service' under systemd.slice). Going
deeper than that doesn't seem likely to be very useful in most cases
(and if you go into user.slice, you have cardinality issues). I
would probably skip '<code>io.stat</code>' for the first version and leave it
until later.</p>
<p>PS: I believe that some of this information can be visualized live
through <a href="https://www.freedesktop.org/software/systemd/man/systemd-cgtop.html">systemd-cgtop</a>.
This may be useful to see if your particular set of systemd services
and so on even have useful information here.</p>
</div>
Some interesting metrics you can get from cgroup V2 systems2024-02-26T21:43:53Z2024-01-18T03:40:46Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSZEDOurZedletUsecks<div class="wikitext"><p>One of the components of <a href="https://openzfs.org/wiki/Main_Page">OpenZFS</a>
is the <a href="https://openzfs.github.io/openzfs-docs/man/master/8/zed.8.html">ZFS Event Daemon ('zed')</a>.
Old ZFS hands will understand me if I say that it's the OpenZFS
equivalent of the Solaris/Illumos fault management system as applied
to ZFS; for other people, it's best described as ZFS's system for
handling (kernel) ZFS events such as ZFS pools experiencing disk
errors. Although the manual page obfuscates this a bit, what ZED
does is it runs scripts (or programs in general) from a particular
directory, normally /etc/zfs/zed.d, choosing what scripts to run
for particular events based on their names. OpenZFS ships with a
number of <em>zedlets</em> ('zedlet' is the name for these scripts), and
you can add your own, which we do in <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our ZFS fileserver environment</a>.</p>
<p>The standard ZED setup supports a number of relatively standard
notification methods, including email; we enable this in our
/etc/zfs/zed.d/zed.rc. The email you get through these standard
notifications is a bit generic but it's a useful starting point
and fallback. Beyond this, we have three additional zedlets we
add:</p>
<ul><li>one zedlet simply syslogs full details about almost all events by doing
almost literally the following:<p>
<blockquote><pre style="white-space: pre-wrap;">
printenv | fgrep 'ZEVENT_' | sort | fmt -999 |
logger -p daemon.info -t 'cslab-zevents'
</pre>
</blockquote>
<p>
ZED has an 'all-syslog.sh' zedlet that's normally enabled, but it
doesn't capture absolutely everything this way and it believes in
reformatting information a bit. We wanted to capture full event
information so we could do as complete a reconstruction of things
as possible later.<p>
</li>
<li>one zedlet syslogs when vdev state changes happen (and what they
are) and immediately triggers <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOurSparesSystemV">our ZFS status reporting and spares
handling system</a>. Because ZED treats individual
disks as vdevs, this is triggered for things like loss of disks and
disk read, write, or checksum errors. Our own system for this will
then email us a report about issues and start any sparing that's
necessary (which will probably result in more email).<p>
</li>
<li>one zedlet syslogs when resilvers complete and triggers a run of
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOurSparesSystemV">our ZFS status reporting and spares handling system</a>. This will report to us when a pool becomes
healthy again and possibly start another round of sparing if we
were holding back to not have too many resilvers happening at once.</li>
</ul>
<p>Because ZED has a hard-coded ten second timeout on zedlets, we have to
run our status reporting and spares handling in the background of the
zedlet, which means we need to use some straightforward shell locking.</p>
<p>The net effect of this setup is that we'll generally get at least
two emails if a disk has problems. One email will be generically
formatted and come from the standard ZED email notification generated
by the various '*-notify.sh' zedlets. The second email comes
from our own ZFS status reporting system, using our own tools to
report and summarize ZFS pool status with informative (for us) disk
names and so on.</p>
<h3>Sidebar: Why we have our own email reporting</h3>
<p>A typical status report can look something like this:</p>
<blockquote><pre style="white-space: pre-wrap;">
Subject: sanhealthmon: details of ZFS pool problems on sanshui
</pre>
<pre style="white-space: pre-wrap;">
Newly degraded pools:
fs16-matter-02 fs16-rahulgk-01 fs16-vision-02
[...]
pool: fs16-rahulgk-01
overall: problems
problems: disk(s) have repaired errors
config:
mirror ONLINE
disk01/0 ONLINE
disk09/0 REPAIRED (errors: 1 read/0 write/0 checksum)
[...]
</pre>
</blockquote>
<p>This is a lot more readable (for us) than decoding the equivalent
in the normal ZFS email, and it also often summarizes the state of
multiple pools if all of them have experienced errors simultaneously
(because, for example, they all use the same physical disk and that
physical disk has had a problem).</p>
</div>
What we use ZFS on Linux's ZED 'zedlets' for2024-02-26T21:43:53Z2024-01-13T03:46:04Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SoftwareRaidSwitchingDiskscks<div class="wikitext"><p>Back at the start of this year I moved my (software RAID) root
filesystem on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2018">my home Fedora desktop</a> from a
mirrored pair of SATA SSDs to a pair of NVMe drives, and this time
I kept notes (although I didn't necessarily follow them). For my
future use, I'm going to write this up, complete with the steps
that I should have done but didn't.</p>
<p>(In this switch, my new disks are nvme0n1p3 and nvme1n1p3, my old
disks were sda3 and sdb3, and md10 was the official name of my root
filesystem's software RAID mirror.)</p>
<p>As is my custom with such disk switches, I first changed my root
filesystem software RAID to being a four way mirror, using both the
SATA SSDs and the NVMe drives. The process for this is to add the extra
devices and then increase the number of devices in the RAID:</p>
<blockquote><pre style="white-space: pre-wrap;">
mdadm -a /dev/md10 /dev/nvme0n1p3
mdadm -a /dev/md10 /dev/nvme1n1p3
mdadm -G -n 4 /dev/md10
</pre>
</blockquote>
<p>If you don't increase the number of devices, you've just added some
spares. This is definitely not what I want; when I do this, I want
the new drives to be in (full) use in parallel to the old ones, as
a burn-in test. (Often an extended one, as it was this time.)</p>
<p>(If you want you can add one device at a time then let your
system run that way for a bit, but I usually don't see any
reason to go through extra steps.)</p>
<p>In the past you needed to update /etc/mdadm.conf to have the new
number of drives in your software RAID array and rebuild your
initramfs (to update its embedded copy of mdadm.conf) or you'd have
boot failures (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/RaidGrowthGotcha">cf</a>). Currently this isn't (or
wasn't) necessary on Fedora, as things appear to accept software
RAID arrays that have more member devices than mdadm.conf specifies,
as I found out when there was an unplanned machine freeze and reboot
before I did the initramfs update.</p>
<p>(<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidDiskCountEffects">Alternately you should take the count of devices out entirely
from your mdadm.conf</a>. Your initramfs
will have to be rebuilt before this takes full effect, but you can
perhaps wait for this to happen as part of your distribution's next
kernel update.)</p>
<p>Once you've decided that your new drives are stable, you transition
away from the old devices by marking them failed and then removing
them:</p>
<blockquote><pre style="white-space: pre-wrap;">
mdadm --fail /dev/md10 /dev/sda3
mdadm --fail /dev/md10 /dev/sdb3
mdadm --remove /dev/md10 /dev/sda3
mdadm --remove /dev/md10 /dev/sdb3
</pre>
</blockquote>
<p>You must use '--remove', not '-r'. After doing this there are two
essential things you need to do, neither of which I actually did,
to my eventual sorrow. First, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidRemovingDiskGotcha"><strong>you have to zero the RAID superblocks
on the old devices</strong></a> (this has
been an issue for <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidShiftingMirrorII">a long time</a>):</p>
<blockquote><pre style="white-space: pre-wrap;">
mdadm --zero-superblock /dev/sda3
mdadm --zero-superblock /dev/sdb3
</pre>
</blockquote>
<p>If you don't zero the old superblocks, <a href="https://mastodon.social/@cks/109739661990640799">your system may well reboot
with their old version of your root filesystem instead of the current
one</a>, and you'll
have to immediately halt the system and physically pull the old
drives (you might as well dust it out while you have it open, if
this is a desktop). If you had other stuff on the old drives in
addition to the old software RAID mirrors, well, you would be in
some trouble.</p>
<p>Once you've removed the old disks (and zeroed their superblocks),
you then need to shrink the number of devices in the software RAID
array back down to two devices (otherwise various things will
complain about missing devices):</p>
<blockquote><pre style="white-space: pre-wrap;">
mdadm -G -n 2 /dev/md10
</pre>
</blockquote>
<p>However, unlike the case of adding drives, <strong>after shrinking the
number of devices in the array you have to update /etc/mdadm.conf
to have the new device count and then rebuild your initramfs</strong> so
that it includes your new mdadm.conf; on Fedora this is done with
with '<code>dracut --force</code>'. Fedora's Dracut initramfs environment will
accept a software RAID array with more devices than specified, but
(perhaps reasonably) it will refuse to accept one with fewer devices.
Alternately, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidDiskCountEffects">you can completely remove <code>num-devices=</code> from your
mdadm.conf</a>, although you'll still
need to rebuild your initramfs if you haven't done this already.</p>
<p>(I believe you get dropped into an emergency rescue shell and are
left to fix things up yourself. I didn't keep notes on this process;
interested parties are encouraged to experiment in a virtual machine.)</p>
<p>When I moved away from the old SATA SSDs, I forgot to zero the old
RAID superblocks and then (after fixing that) I discovered that I'd
incorrectly assumed that Fedora's initramfs didn't care about all
drive number changes. Hopefully I'll remember next time around, or
at least re-read this entry, which is (or was) current as of my
experiences in early to mid 2023 (things keep changing in this area
of Linux).</p>
<p>As advice for my future self, what I should have done is <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/AlwaysMakeAChecklist">written
out a full checklist in advance</a>
and then ticked things off as I went through them. This would have
made sure that I didn't forget important steps (like zeroing the
old RAID superblocks), or let them slide with the excuse that they'd
happen as a side effect of my next kernel update (because my system
can always reboot by surprise before then).</p>
<p>(I've written entries about this in the past, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidShiftingMirror">1</a>, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidShiftingMirrorII">2</a>,
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidRemovingDiskGotcha">3</a>, as well as <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ShrinkingSoftwareRAIDSwap">shrinking a
mirrored swap partition</a>.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidSwitchingDisks?showcomments#comments">2 comments</a>.) </div>Switching Linux software RAID disks around in (early) 20232024-02-26T21:43:53Z2024-01-01T03:52:23Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSPanicsNotKernelPanicscks<div class="wikitext"><p>Suppose that you have <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">a ZFS based server</a> and
one day its kernel messages contain the following:</p>
<blockquote><pre style="white-space: pre-wrap;">
VERIFY3(sa.sa_magic == SA_MAGIC) failed (1446876386 == 3100762)
PANIC at zfs_quota.c:89:zpl_get_file_info()
Showing stack for process 6711
CPU: 13 PID: 6711 Comm: dp_sync_taskq Tainted: P O 5.15.0-88-generic #98-Ubuntu
Hardware name: Supermicro Super Server/X11SPH-nCTF, BIOS 2.0 11/29/2017
Call Trace:
<TASK>
show_stack+0x52/0x5c
dump_stack_lvl+0x4a/0x63
dump_stack+0x10/0x16
spl_dumpstack+0x29/0x2f [spl]
spl_panic+0xd1/0xe9 [spl]
? dbuf_rele_and_unlock+0x134/0x540 [zfs]
[...]
</pre>
</blockquote>
<p>Obviously you've hit a ZFS kernel panic, where ZFS handles internal
problems in <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSPanicOnCorruptionFlaw">its traditional way</a>,
which is to say by panicing and crashing your server. Except that
is almost certainly a lie.</p>
<p>Unless you've changed a non-obvious ZFS kernel parameter, your Linux
kernel has not actually paniced; ZFS is merely pretending that it
has. We can actually see this in the kernel stack trace being shown
here, which lists <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/spl/spl-err.c#L41">spl_dumpstack()</a>
and especially <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/spl/spl-err.c#L49">spl_panic()</a>.
There's also a comment about this in <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/spl/spl-err.c#L30">the source file</a>:</p>
<blockquote><p>It is often useful to actually have the panic crash the node so you
can then get notified of the event, get the crashdump for later
analysis and other such goodies. <br>
But we would still default to the current default of not to do that.</p>
</blockquote>
<p>Let me be clear: I think this is a terrible choice for almost
everyone except ZFS developers themselves. This looks like a kernel
panic to non-experts, in that it has 'PANIC' in the message, it
dumps very similar information to a <a href="https://en.wikipedia.org/wiki/Linux_kernel_oops">Linux kernel OOPS</a> or other 'panic',
and so on. However, because it's not an actual panic it won't trigger
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/RebootOnPanicSettings">any kernel settings you've made to force reboots on panics</a>. Instead it will likely leave your ZFS
fileserver with a steadily increasing number of ZFS kernel threads
hung waiting for locks, and then force you to reboot things by hand
when the problems get really bad (probably uncleanly). This can
leave you rather puzzled about what's going on and cause unclear
system problems for the proverbial some time (we had one fileserver
last for over an hour in this state before it became non-functional
enough to trigger alerts).</p>
<p>To force ZFS on Linux to actually panic the kernel when ZFS hits
one of these internal 'panics', you need to set the SPL module
parameter <a href="https://openzfs.github.io/openzfs-docs/man/master/4/spl.4.html#spl_panic_halt">spl_panic_halt</a>
to 1. On a live system, this is done with:</p>
<blockquote><pre style="white-space: pre-wrap;">
echo 1 >/sys/module/spl/parameters/spl_panic_halt
</pre>
</blockquote>
<p>To make this permanent, you'll need to create a suitable .conf file
in /etc/modprobe.d, for example:</p>
<blockquote><pre style="white-space: pre-wrap;">
$ cat /etc/modprobe.d/spl.conf
options spl spl_panic_halt=1
</pre>
</blockquote>
<p>I recommend including some comments about why this is necessary, so
in the future you can understand why you have this mysterious setting.</p>
<p>In an ideal world, the text 'PANIC' in these non-panics would be
replaced with something less misleading, like 'SPL-PANIC' or
'SPL-HALTING' (unless the system was actually panicing). That would
at least make it clear that this was not a regular kernel panic and
came from ZFS's SPL component, not the regular kernel. Better would
be to change the default of spl_panic_halt or to otherwise align
these SPL panics with normal Linux kernel bug handling.</p>
<p>PS: This doesn't have in OpenZFS on FreeBSD, where <a href="https://github.com/openzfs/zfs/blob/master/module/os/freebsd/spl/spl_misc.c#L95">the FreeBSD
version of spl_panic()</a>
simply calls <a href="https://man.freebsd.org/cgi/man.cgi?query=vpanic&apropos=0&sektion=9&manpath=FreeBSD+11-current&format=html">vpanic(9)</a>
and so triggers FreeBSD's normal kernel panic behavior and
infrastructure.</p>
</div>
Your kernel panics in ZFS on Linux probably aren't actual kernel panics2024-02-26T21:43:53Z2023-12-30T02:46:26Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdStallAfterTooFastRestartscks<div class="wikitext"><p>Over on the Fediverse, <a href="https://mastodon.social/@cks/111619851045624217">I said something</a>:</p>
<blockquote><p>Recently I learned that if you manually restart a systemd service too
often (with 'systemctl restart ...'), systemd will by default stop
starting it:</p>
<pre style="white-space: pre-wrap;">
<x>.service: Start request repeated too quickly.
<x>.service: Failed with result 'start-limit-hit'.
Failed to start <x>.service - Whatever it is.
</pre>
<p>Why would you do that, you ask? Well, consider scripts that update
some data file and do a 'systemctl restart ...' to make the daemon
notice it. Now try to do a bunch of updates all at once.</p>
</blockquote>
<p>The traditional way to have systemd stop starting a service is <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdRestartUseDelay">for
it to have a 'Restart=' setting with no restart delay, and then to
fail on startup</a>. Sometimes it's failing on
start because your machine is out of memory; sometimes it's because
you've made an error in its configuration files.
However, if you read the actual documentation for <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#StartLimitIntervalSec=interval">StartLimitIntervalSec
and StartLimitBurst</a>,
they don't say they're limited to the 'Restart=' case. Here's what they
say, emphasis mine:</p>
<blockquote><p>Configure unit start rate limiting. <strong>Units which are started more
than <em>burst</em> times within an <em>interval</em> time span are not permitted to
start any more.</strong> [...]</p>
<p>These configuration options are particularly useful in conjunction
with the service setting <code>Restart=</code> (see <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html">systemd.service(5)</a>);
however, <strong>they apply to all kinds of starts (including manual)</strong>, not
just those triggered by the Restart= logic.</p>
</blockquote>
<p>The way you clear this condition is also sort of mentioned in that
section of the manual page; '<code>systemctl reset-failed</code>' will reset
this counter and allow you to immediately (re)start the unit again.
If you want, you can restrict the resetting to just your particular
unit.</p>
<p>The default limits for this rate limiting are likely visible in the
commented out default values in <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd-system.conf.html">/etc/systemd/system.conf</a>.
The normal standard values are five restarts in ten seconds (<a href="https://www.freedesktop.org/software/systemd/man/latest/systemd-system.conf.html#DefaultStartLimitIntervalSec=">cf</a>)
and it appears that neither Fedora nor Ubuntu change these defaults,
so that's probably what you'll see.</p>
<p>You might wonder how you get yourself into this situation in the
first place. Suppose that <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SystemEvolution">you have a script to add an entry to a
DHCP configuration file</a>, which as
part of activating the entry has to restart the DHCP server (because
it doesn't support on the fly configuration reloading). Now suppose
you have a bunch of entries to add; you might write a script (or a
for loop) to effectively bulk add them as fast as the commands can
run. When you run that script, you'll be restarting the DHCP server
repeatedly, as fast as possible, and it won't take too long before
you trigger systemd's default limit (since all you need with the
default limits is to go through the whole thing in less than two
seconds per invocation).</p>
<p>If you're doing this in a script, the two solutions I see are to
always make the script sleep for three seconds or so after a restart,
or to run 'systemctl reset-failed <service>' either at the end of
the script or before you start doing any 'systemctl restart's.</p>
<p>(I'm not sure which of these we'll adopt.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdStallAfterTooFastRestarts?showcomments#comments">3 comments</a>.) </div>Systemd will block a service's start if you manually restart it too fast2024-02-26T21:43:53Z2023-12-22T03:44:58Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/GrubUnknownFilesystemWhycks<div class="wikitext"><p>Over on the Fediverse, <a href="https://mastodon.social/@cks/111604177856526444">I said something</a>:</p>
<blockquote><p>I hope that the Grub developers will someday fix grub-install so that
the "unknown filesystem" error is replaced with a better one, like
"Grub doesn't have the driver(s) necessary to use your / (or /boot)
filesystem" or even "Grub doesn't currently support some filesystem
features that are enabled on your / (or /boot) filesystem". Ideally
with the right filesystem name.</p>
<p>This has certainly been coming up and getting forum/etc answers for
long enough. But alas.</p>
</blockquote>
<p>Actually fixing the message to be accurate is difficult because of
how Grub's code is structured. The simplest improvement is to change
the text of the message to "unknown filesystem or filesystem with
unsupported features", which at least hints at the potential issue
(although the message would have to be re-translated into various
languages and so on, so perhaps the Grub developers would be
unenthused).</p>
<p>This message can be produced either by grub-install, running on a
booted system, or by the Grub bootloader code itself, as you boot
the system. Normally it's seen when you run grub-install, which is
somewhat puzzling; how is the filesystem unknown when the kernel
is using it? And why does grub-install care?</p>
<p>When Grub is booting your system, it doesn't (and can't) use the
Linux kernel's filesystem code and device drivers (or any Unix
kernel's code; Grub runs in non-Linux environments as well). At the
same time, Grub wants to read various things from your filesystems,
such as its menu file or your kernel (and on Linux, initramfs). To
do this, Grub has its own collection of <a href="https://git.savannah.gnu.org/gitweb/?p=grub.git;a=tree;f=grub-core/fs">filesystem code</a> and
<a href="https://git.savannah.gnu.org/gitweb/?p=grub.git;a=tree;f=grub-core/disk">software disk drivers</a>,
generally in a collection of loadable (Grub) modules. When grub-install
runs, one of its jobs is to prepare the set of filesystem and disk
driver modules Grub will need at boot time. Its report of "unknown
filesystem" means that it can't find a filesystem module that will
accept the filesystem that you have things on (generally either the
root filesystem or your /boot filesystem, depending on whether /boot
is on its own filesystem).</p>
<p>The specific message is generated in grub_fs_probe() in <a href="https://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/kern/fs.c">kern/fs.c</a>.
This function is handed a 'grub device' and runs through grub's
list of known filesystem modules, asking each one of them in turn
if they can handle the filesystem on the 'grub device'. Currently,
filesystem modules return the same error code if the device isn't
their type of filesystem or if it's their type of filesystem but
it has filesystem features that the Grub module doesn't (yet)
support. The filesystem module can set a specific error message
here (in addition to its error code), but grub_fs_probe() doesn't
normally report the per filesystem error messages unless (the right
sort of) debugging is turned on (this can be done in grub-install
with '-vv', although that enables all debugging messages and produces
a lot of messages). Instead, if all filesystem modules say they
can't handle the filesystem, grub_fs_probe() reports a generic
"unknown filesystem" error. One level up, <a href="https://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=util/grub-install.c">grub-install.c</a>
calls grub_fs_probe() (in a couple of different places) and
then reports the error message that it's produced (if it failed).</p>
<p>Fixing this to return an exact error message about what's wrong is
at least a little bit tricky and would make the code more complicated.
It also touches a relatively critical piece of Grub, since this
code is also run during boot (and must properly accept the filesystem
then). So I suspect the most that Grub developers would do is change
the message to a longer version that mentions the possibility of
feature flag mismatches.</p>
</div>
Why grub-install can give you an "unknown filesystem" error2024-02-26T21:43:53Z2023-12-20T03:44:41Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdResolvedSingleNamesDNScks<div class="wikitext"><p>Suppose, not hypothetically, that you use <a href="https://www.freedesktop.org/software/systemd/man/systemd-resolved.service.html">systemd-resolved</a>
and you have a long standing practice of specific DNS search path
so that people can use short domain names. In this environment <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdResolvedNotFor">you
probably need to use systemd-resolved purely through /etc/resolv.conf</a>, and if you do this you may experience an
oddity:</p>
<blockquote><pre style="white-space: pre-wrap;">
$ ping nosuchname
ping: nosuchname: Temporary failure in name resolution
</pre>
</blockquote>
<p>If you try '<code>resolvectl query nosuchname</code>' it will tell you that the name
is not found, but if you directly query the systemd-resolved DNS server at
127.0.0.53 you will see that you get a DNS SERVFAIL response for the bare
name:</p>
<blockquote><pre style="white-space: pre-wrap;">
$ dig a nosuchname. @127.0.0.53
[...]
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 52471
[...]
</pre>
</blockquote>
<p>(You will wind up querying for the bare name when you've exhausted
all of the domains in your DNS search path.)</p>
<p>This is not what a normal DNS server like Unbound will return for
the same query; Unbound will return NXDOMAIN for this query, which
will cause programs like ping to tell you 'Name or service not
known', which is probably what you want. If you know what's going
on you can translate, but why worry yourself about the possibility
that something is really going wrong.</p>
<p>What is going on here is systemd-resolved's interpretation of how
to behave for DNS queries if <a href="https://www.freedesktop.org/software/systemd/man/latest/resolved.conf.html#ResolveUnicastSingleLabel="><code>ResolveUnicastSingleLabel</code></a>
is unset in your <a href="https://www.freedesktop.org/software/systemd/man/latest/resolved.conf.html">resolved.conf</a>.
How the documentation describes it is:</p>
<blockquote><p>Takes a boolean argument. When false (the default), systemd-resolved
will not resolve A and AAAA queries for single-label names over
classic DNS. [...]</p>
</blockquote>
<p>Since ping's attempts to find the IP address of 'nosuchname'
eventually wind up making a single-label name query to systemd-resolved,
with this setting in its default state systemd-resolved will not
try to resolve this query by sending it to an upstream DNS resolver
(where it would fail). When queried as a DNS server, resolved's
interpretation of 'will not (try to) resolve' is to return SERVFAIL
instead of NXDOMAIN. This is in some sense technically correct, but
it's usually not as useful as returning NXDOMAIN would be (and it's
not how Unbound or Bind behave).</p>
<p>If you have local DNS resolvers that systemd-resolved on your systems
is pointing to, you can safely set <code>ResolveUnicastSingleLabel=yes</code>
to work around this. Systemd-resolved will dutifully send these
queries to your local DNS resolvers, your local DNS resolvers will
NXDOMAIN them, and systemd-resolved will pass this NXDOMAIN back
to you so that ping tells you there's no such host. I'm probably
going to do this on my desktops (and any of <a href="https://support.cs.toronto.edu/">our</a> machines that wind up using
systemd-resolved).</p>
<p>(A lot of my understanding of this comes from finding and reading
<a href="https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/2024320">Ubuntu systemd bug #2024320</a>
and <a href="https://github.com/systemd/systemd/issues/28310">systemd issue #28310</a>.)</p>
<h3>Sidebar: Some thoughts on SERVFAIL versus NXDOMAIN here</h3>
<p>If you have upstream DNS servers that will actually return something
for A and AAAA queries for single-label names for some (local)
reason, systemd-resolved returning SERVFAIL and ping reporting it
as a 'temporary' failure in name resolution is probably doing you
a favour because it's signalling that something weird is going on
in your (DNS) name resolution. Systemd-resolved returning NXDOMAIN
might lead you to suspect that your upstream DNS servers didn't
have the data you expected them to.</p>
<p>However, this is a rare case. A much more usual case is going to
be what we saw here; you have a DNS search path, you type a name
that you implicitly expect to be in your local domain or not present
at all, and instead of 'name or service not known' because it's not
in your local domain you get some odd 'temporary failure' (which
doesn't happen if you use resolvectl to theoretically check directly).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdResolvedSingleNamesDNS?showcomments#comments">2 comments</a>.) </div>Why systemd-resolved can give weird results for nonexistent bare hostnames2024-02-26T21:43:53Z2023-12-14T03:37:59Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/CgroupsMemoryUsageAccountingIIcks<div class="wikitext"><p>A while back I wrote a program I call 'memdu' to report a du-like
hierarchical summary of how much memory is being used by each logged
in user and each system service, based on systemd's <a href="https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemoryAccounting=">MemoryAccounting</a>
setting and <a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files">the general Linux cgroup (v2) memory accounting</a>.
Cgroups expose a number of pieces of information about this, starting
with <code>memory.current</code>, the current amount of memory 'being used by'
the cgroup and its descendants. What being used by means here is
that the kernel has attributed this memory to the cgroup, and it
counts all memory usage attributed to the cgroup, both user level
and in the kernel. <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CgroupsMemoryUsageAccounting">As I very soon found out</a>, this number can be misleading if
what you're really interested in is how much user level memory the
cgroup is actively using.</p>
<p>My first encounter with this was for <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CgroupsMemoryUsageAccounting">a bunch of memory used by
the kernel filesystem cache</a>, which
was attributed first to a running virtual machine and then to the
general 'machine.slice' cgroup when the virtual machine was shut
down and its cgroup went away. (Well, it was always attributed to
machine.slice as well as the individual virtual machine, but when
the virtual machine existed you could see that a lot of machine.slice's
memory usage was from the child VM.)</p>
<p>As I recently discovered, another source of this is reclaimable
(kernel) slab memory. It's possible to have an essentially inactive
user cgroup with small process memory usage but gigabytes of memory
attributed to it from memory.stat's '<code>slab_reclaimable</code>'. At
some point this slab memory was actively used, but it's now not,
and presumably it lingers around mostly because the overall system
hasn't been under enough memory pressure to trigger reclaiming it.
Having my memdu program report the memory usage of the cgroup
including this memory is in one sense honest, but it's not usually
useful and it can be alarming.</p>
<p>(According to <a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files">the documentation</a>,
you can manually trigger a kernel reclaim against the cgroup by
writing an amount to '<code>memory.reclaim</code>'. But if there's no general
memory pressure, I think the only reason to do this is aesthetics.)</p>
<p>If I knew enough about the kernel memory systems in practice, I
could probably read through <a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files">the documentation about the cgroup
memory.stat file</a>
and work out what things I wanted to remove from memory.current to
get more or less 'current directly and indirectly used user memory'.
As it is, I don't have that knowledge so I suspect that I'm going
to find more cases like this over time.</p>
<p>(How I find these is that someday I run my memdu program and it
reports an absurd looking number for some cgroup, so I investigate
and then fix it up with more heuristics. These days <a href="https://utcc.utoronto.ca/~cks/space/blog/python/OsWalkChoiceParalysis">the program
is in Python</a> so it's pretty easy
to add another case.)</p>
<p>I suspect that one of the general issues I'm running into is that
what I want from my 'memdu' program isn't well specified and may
not be something that the kernel can really give me. The question
of how much memory a cgroup is using depends on what I mean by
'using' and what sort of memory I care about. The kernel is only
really set up to tell me how much memory has been attributed to a
cgroup, and where it is in potentially overlapping categories in
memory.stat.</p>
<p>(I assume that <code>memory.stat</code> is comprehensive, so all memory in
<code>memory.current</code> is accounted for somewhere in <code>memory.stat</code>, but
I'm not sure of that.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CgroupsMemoryUsageAccountingII?showcomments#comments">2 comments</a>.) </div>Understanding another piece of per-cgroup memory usage accounting2024-02-26T21:43:53Z2023-12-07T04:28:05Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/DCacheGettingStatscks<div class="wikitext"><p>The Linux kernel's <em>dcache</em> subsystem is its implementation of a
<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/KernelNameCachesWhy">name cache of directory entries</a>;
it holds <em>dentries</em>. As a (kernel) cache, it would be nice to know
some information about this cache and how effective it was being
for your worklog. Unfortunately the current pickings appear to be
slim.</p>
<p>Basic information about the size of the dcache is exposed in
<a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/fs.html#dentry-state">/proc/sys/fs/dentry-state</a>.
This reports the total number of dentries, how many are 'unused',
and how many are negative entries for files that don't exist (along
with some other numbers). There's no information on either the
lookup rate or the hit rate, and I believe that the kernel doesn't
track this information at all (it sizes the dcache based on other
things).</p>
<p>The <a href="https://github.com/iovisor/bcc">BCC tools</a> include a (BCC)
program called <a href="https://github.com/iovisor/bcc/blob/master/tools/dcstat_example.txt">dcstat</a>. As
covered in its documentation, this tool will print running dcache
stats (provided that it works right on your kernel). The <a href="https://github.com/iovisor/bcc#storage-and-filesystems-tools">Storage
and Filesystem Tools</a>
section of the BCC tools listings has additional tools that may be
of interest in this general area. Although <a href="https://github.com/iovisor/bpftrace">bpftrace</a> has bpftrace-based versions
of a lot of the BCC tools (see its <a href="https://github.com/iovisor/bpftrace/tree/master/tools">tools/</a> subdirectory),
it doesn't seem to have done a bpftrace version of dcstat.</p>
<p>(The other caution about dcstat is that based on comments in <a href="https://github.com/iovisor/bcc/blob/master/tools/dcstat.py">the
dcstat source code</a> I'm
not sure that it's still right for current kernels. I think the
overall usage rate is probably correct, but I'm not sure about the
'miss' numbers. I'd have to read <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c">fs/namei.c</a>
and <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dcache.c">fs/dcache.c</a>
very carefully to have much confidence.)</p>
<p>As far as I can see, <a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/fs.html#dentry-state">/proc/sys/fs/dentry-state</a> is not exposed
by <a href="https://github.com/prometheus/node_exporter">the Prometheus host agent</a>. It might be exposed
by the host agents for other metrics systems, or they might have
left it out because there's not much you can do about the dcache
anyway. If you wanted to export dcache hit and miss information,
you could use <a href="https://github.com/cloudflare/ebpf_exporter">the Cloudflare eBPF exporter</a> and write an appropriate
eBPF program for it, based on <a href="https://github.com/iovisor/bcc/blob/master/tools/dcstat.py">dcstat</a>.</p>
<p>Now that I've looked at this, I suspect that while using <a href="https://github.com/iovisor/bcc/blob/master/tools/dcstat.py">dcstat</a>
may be interesting if you're curious about how many file lookups
various operations do, it's probably not all that useful to monitor
on an ongoing basis.</p>
<p>(In its current state, dcstat won't tell you how many hits were for
negative dentries, which might be interesting to know so you can
see how many futile lookups are happening on the system.)</p>
</div>
Getting some information about the Linux kernel dentry cache (dcache)2024-02-26T21:43:53Z2023-12-05T04:36:25Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/CoreutilsTestPeculiaritycks<div class="wikitext"><p>Famously, '<code>[</code>' is a program, not a piece of shell syntax, and it's
also known as '<code>test</code>' (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/V7TestAndBourneShell">which was the original name for it</a>). On many systems, this was and is
implemented by '<code>[</code>' being a hardlink to '<code>test</code>' (generally 'test'
was the primary name for various reasons). However, <a href="https://infosec.exchange/@adb/111467421939302794">today I found
out that GNU Coreutils is an exception</a>. Although the
two names are built from the same source code (<a href="https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/test.c">src/test.c</a>), they're
different binaries and the '<code>[</code>' binary is larger than the '<code>test</code>'
binary. What is ultimately going on here is a piece of '<code>test</code>'
behavior that I had forgotten about, that of the meaning of running
'<code>test</code>' with a single argument.</p>
<p><a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/test.html">The POSIX specification for test</a> is
straightforward. A single argument is taken as a string, and the behavior
is the same as for -n, although POSIX phrases it differently:</p>
<blockquote><dl><dt><em>string</em></dt>
<dd>True if the string string is not the null string; otherwise,
false.</dd>
</dl>
</blockquote>
<p>The problem for GNU Coreutils is that GNU programs like to support
options like --help and --version. Support for these is specifically
disallowed for '<code>test</code>', where '<code>test --help</code>' and '<code>test --version</code>'
must both be silently true. However, this is not disallowed by POSIX
for '<code>[</code>' if '<code>[</code>' is invoked without the closing '<code>]</code>':</p>
<blockquote><pre style="white-space: pre-wrap;">
$ [ --version
[ (GNU coreutils) 9.1
[...]
$ [ foo
[: missing ‘]’
$ [ --version ] && echo true
true
</pre>
</blockquote>
<p>As we can see here, invoking 'test' as '<code>[</code>' without the closing
'<code>]</code>' as an argument is an error, and GNU Coreutils is thus allowed
to interpret the results of your error however it likes, including
making '<code>[ --version</code>' and so on work.</p>
<p>(There's <a href="https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/test.c#n825">a comment about it in test.c</a>.)</p>
<p>The binary size difference is presumably because the '<code>test</code>' binary
omits the version and help text, along with the code to display it.
But if you look at <a href="https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/test.c#n825">the relevant Coreutils test.c code</a>, the
relevant code isn't disabled with an #ifdef. Instead, LBRACKET is
#defined to 0 when compiling the '<code>test</code>' binary. So it seems that
modern C compilers are doing dead code elimination on the '<code>if
(LBRACKET) { ...}</code>' section, which is a well established optimization,
and then going on to notice that the called functions like '<code>usage()</code>'
are never invoked and dropping them from the binary. Possibly this is
set with some special link time magic flags.</p>
<p>PS: This handling of a single argument for <code>test</code> goes all the way
back to V7, <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/TestIsQuiteSmart">where <code>test</code> was actually pretty smart</a>. If I'm reading <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/man/man1/test.1">the V7 test(1) manual
page</a>
correctly, this behavior was also documented.</p>
<p>PPS: In theory <a href="https://www.gnu.org/software/coreutils/">GNU Coreutils</a>
is portable and you might find it on any Unix. In practice I believe
it's only really used on Linux.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CoreutilsTestPeculiarity?showcomments#comments">5 comments</a>.) </div>A peculiarity of the GNU Coreutils version of '<code>test</code>' and '<code>[</code>'2024-02-26T21:43:53Z2023-11-25T03:46:09Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSSortingOutPoolFeaturescks<div class="wikitext"><p>Pretty much every filesystem that wants to be around for a long
time needs some way to evolve its format, adding new things (and
stopping using old ones); ZFS is no exception. In the beginning,
<a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSPoolVersionProblem">the format of ZFS pools (and filesystems) was set by a version
number</a>, but this stopped working
very well once Sun were no longer the only people evolving ZFS. To
handle the situation with multiple people developing different
changes to ZFS, ZFS created a system of what are called 'features',
where each feature is more or less some change to how ZFS pools
work. Most features are officially independent of each other (although
they may not be tested independently in practice). All of this is
documented today in the <a href="https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html">zpool-features(7)</a>
manual page, which discusses the general system in detail and then
lists all of the current features.</p>
<p>(Your local copy of <a href="https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html">zpool-features(7)</a> may well list fewer
features than the latest upstream development version does. For
instance, there's <a href="https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html#raidz_expansion">a feature for RAID-Z expansion</a>,
which only just landed in the development version.)</p>
<p>Each release or version of ZFS supports some set of features,
increasing over time. The Ubuntu 22.04 version of ZFS supports more
ZFS features than the Ubuntu 18.04 version did, for example. Moving
to a new version of ZFS (for example by upgrading <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">your fileservers</a> from Ubuntu 18.04 to 22.04) deliberately
doesn't change the features your current ZFS pools have. Only manual
action such as '<code>zpool upgrade -a</code>' will update them to use new
features, and you may well hold off on this even though you've
updated ZFS versions.</p>
<p>(One reason to hold off is that perhaps you're worried about reverting
to your pre-upgrade state. Another reason is just that you haven't
gotten around to it. In the old Solaris 10 days, a 'zpool upgrade' of
a pool would cause some degree of service interruption, although I
don't think that's supposed to happens today.)</p>
<p>In the very old days, 'zpool status -x' would consider available
pool format updates to be an 'error' that made a pool worthy of
including in its output, which was kind of infuriating. Later,
'zpool status' downgraded this to <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSZpoolStatusAndUpgrades">merely nagging you all the time</a>. Finally, ZFS introduced a
pool property where you could specify what features you wanted your
pools to have, via <a href="https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html#Compatibility_feature_sets">compatibility feature sets</a>
and setting the '<code>compatibility</code>' property to a suitable value. If
you set the pool's compatibility property to, say, 'openzfs-2.1-linux',
and your pool had all of those features, 'zpool status' now won't
claim that it's out of date. Unfortunately, '<code>zpool upgrade</code>' will
still report features that it claims can be upgraded to, although
<a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSPartialUpgradeOption">any actual upgrade is supposed to be limited to the compatibility
features</a>.</p>
<p>As part of these compatibility sets, there are files that list all
of the features in each named set, normally found under
/usr/share/zfs/compatibility.d. The format of these files is
straightforward and can be used with diff to see that, for example,
the features that were added between OpenZFS 2.1 for Linux and
OpenZFS 2.2 were <a href="https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html#blake3">blake3</a>,
<a href="https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html#block_cloning">block_cloning</a>,
<a href="https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html#head_errlog">head_errlog</a>,
<a href="https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html#vdev_zaps_v2">vdev_zaps_v2</a>,
and <a href="https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html#zilsaxattr">zilsaxattr</a>
(all of which you can read about in <a href="https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html">zpool-features(7)</a>). Often
there are convenient symbolic links, so you can see the difference
in features that were present on Ubuntu 18.04 (where most of our
current ZFS pools were created) and that are now available on Ubuntu
22.04 (which we're now running, so we could update pools to have
the new features like zstd compression).</p>
<p>Basic information on what features each of your pools don't have
enabled yet can be seen with 'zpool upgrade'. Unfortunately there's
no convenient way to get this information for a single pool, because
'<code>zpool upgrade POOL</code>' upgrades the pool, not lists not yet enabled
features for just that pool. Also, 'zpool upgrade' will list all
features, ignoring the constraints of any '<code>compatibility</code>' property
you may have set on the pool. You can use '<code>zpool status POOL</code>' to
see if a specific pool is fully up to date to its <code>compatibility</code>
property (if any), but that's all it can tell you; if it says that
the pool hasn't enabled all supported features, there's nothing
that will readily tell you which compatible features aren't yet
enabled while excluding features you've said are incompatible.</p>
<p>(As far as I can see from the code, upgrading a pool's features
through 'zpool upgrade' does respect its '<code>compatibility</code>' setting,
as documented. The current 'zpool upgrade' code to list features
that aren't enabled doesn't have any code to cross-check them against
your '<code>compatibility</code>', although I think it would be simple to add.)</p>
<p>Pool features are exposed as 'feature@<name>' ZFS pool properties,
so you can see a complete list of the features your version of ZFS
supports and their state for any particular pool with '<code>zpool get
all POOL</code>' (this comes for free with all other pool properties, so
if you want just the features you'll have to throw in a '<code>| grep
feature@</code>'). This is the detailed state, so a feature can be
'<code>disabled</code>', '<code>enabled</code>', or '<code>active</code>'; however, whether or not
the feature is <a href="https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html#Read-only_compatibility">read-only compatible</a>
isn't listed. You can check a specific feature's state with, for
example, '<code>zpool get feature@block_cloning</code>', which can be
reassuring if <a href="https://github.com/openzfs/zfs/releases/tag/zfs-2.2.1">there are reports that a particular feature might
cause ZFS pool corruption, prompting a new OpenZFS release with the
feature disabled in the kernel code</a>.</p>
<p>(The <a href="https://github.com/openzfs/zfs/releases/tag/zfs-2.2.1">OpenZFS 2.2.1 release</a> prompted
my sudden interest in this area, since I run the ZFS development
versions, and caused me to realize that I had once again forgotten
how to get a full list of pool features and their state. Maybe I'll
remember '<code>zpool get all POOL</code>' this time around.)</p>
<p>PS: ZFS pool features and pool upgrades are a different thing from
<a href="https://openzfs.github.io/openzfs-docs/man/master/8/zfs-upgrade.8.html">ZFS filesystem (format) upgrades</a>.
Filesystem format upgrades are still version number based, and I
believe the last one was done back when Sun was still a going
concern.</p>
<h3>Sidebar: Some code trivia</h3>
<p>Although ZFS features are represented in the pool by name, the
current OpenZFS code has a big numbered list of all of the features
it knows about, in <a href="https://github.com/openzfs/zfs/blob/master/include/zfeature_common.h">include/zfeature_common.h</a>.
These are the features that, for example, 'zpool upgrade' will tell
you that your pool doesn't have enabled. At the moment it appears
that there are 41 of them (<a href="https://github.com/openzfs/zfs/blob/master/lib/libzfs/libzfs.abi">cf</a>).</p>
<p>According to comments in <a href="https://github.com/openzfs/zfs/blob/master/module/zfs/zfeature.c">module/zfs/zfeature.c</a>,
enabling a feature shouldn't have any effect, unlike what happened
to us with pool version upgrades back in the Solaris days. This
should mean that upgrading a pool is a low-impact operation, since
unless you have a very old pool all it's doing is enabling a number
of features (many of which may not even become active any time soon,
such as zstd compression).</p>
</div>
Understanding and sorting out ZFS pool features2024-02-26T21:43:53Z2023-11-23T03:50:25Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ModernProxyIPv6AndARPcks<div class="wikitext"><p>Suppose, not hypothetically, that you have a remote system (on the
other side of some tunnel or other connection) that wants to pretend
to be on the local network, for either or both of IPv4 and IPv6.
To make this work smoothly, this remote system's gateway (on the
local network) needs to answer <a href="https://en.wikipedia.org/wiki/Address_Resolution_Protocol">ARP</a> requests
for this remote system's IPv4 address and/or <a href="https://en.wikipedia.org/wiki/Neighbor_Discovery_Protocol">NDP</a> requests
for the remote system's IPv6 address. This is called 'proxy ARP'
or 'proxy NDP', because the gateway is acting as an ARP or NDP proxy
for the remote system.</p>
<p>At this point my memories are vague, but I think that in the old
days, configuring proxy ARP on Linux was somewhat challenging and
obscure, requiring you to add various magic settings in various
places. These days it has gotten much easier and more uniform, and
there are at least two approaches, the by hand one and the systemd
one, although it turns out I don't know how to make systemd work
for the IPv4 proxy ARP case.</p>
<p>The by hand approach is with the <a href="https://man7.org/linux/man-pages/man8/ip-neighbour.8.html"><code>ip neighbour</code></a>
(sub)command. This can be used to add IPv4 or IPv6 proxy
announcements to some network, which is normally the network
the remote machine is pretending to be on:</p>
<blockquote><pre style="white-space: pre-wrap;">
ip neigh add proxy 128.X.Y.Z dev em0
ip neigh add proxy 2606:fa00:.... dev em0
# apparently necessary
echo 1 >/proc/sys/net/ipv6/conf/em0/proxy_arp
</pre>
</blockquote>
<p>Here em0 is the interface that the 128.X.Y.0/24 and 2606:fa00:.../64
networks are on, where we want other machines to see 128.X.Y.Z (and
its IPv6 version) as being on the network.</p>
<p>You can see these proxies (if any) with '<code>ip neigh show proxy</code>'.
To actually be useful, the system doing proxy ARP also generally
needs to have IP forwarding turned on and to have appropriate routes
or other ways to get packets to the IP it's proxying for.</p>
<p>Although there is a /proc/sys/net/ipv4/conf/*/proxy_arp setting
(<a href="https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html">cf</a>),
it appears to be unimportant in today's modern 'ip neighbour' based
setup. One of my machines is happily doing proxy ARP with this at
the default of '0' on all interfaces. IPv6 has a similar
ipv6/conf/*/proxy_ndp, but unlike with IPv4, the setting here
appears to matter and you have to turn it on on the relevant
interface; it's on for the relevant interface on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Ubuntu2204WireGuardIPv6Gateway">my IPv6 gateway</a> and turning it off makes external
pings stop working.</p>
<p>(It's possible that other settings are affecting my lack of need
for proxy_arp in my IPv4 case.)</p>
<p>The systemd way is to set up <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.network.html">a systemd-networkd .network file</a>
that has the relevant settings. You set this on the interface where
you want the proxy ARP or NDP to be on, not on the tunnel interface
to the remote machine (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Ubuntu2204WireGuardIPv6Gateway">as I found out</a>).
For IPv6, you want to set <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.network.html#IPv6ProxyNDP=">IPv6ProxyNDP=</a>
and at least one <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.network.html#IPv6ProxyNDPAddress=">IPv6ProxyNDPAddress=</a>,
although it's not strictly necessary to explicitly set IPv6ProxyNDP
(I'd do it for clarity). I was going to write something about how
to do this for IPv4, but I can't actually work out how to do the
equivalent of 'ip neigh add proxy ...' in systemd .network files;
all they appear to do is support turning on proxy ARP in general,
and I'm not sure what this does these days.</p>
<p>(If it's like eg <a href="https://unix.stackexchange.com/questions/250800/linux-does-not-proxy-arp-for-me-despite-the-documentation-suggesting-that-it-do">this old discussion</a>,
then it may cause Linux to do proxy ARP for anything that it has routes
for. There's also <a href="https://wiki.debian.org/BridgeNetworkConnectionsProxyArp">this Debian Wiki page</a> suggesting the
same thing.)</p>
<p>I don't know if <a href="https://networkmanager.dev/">NetworkManager</a> has
much support for proxy ARP or proxy NDP, since both seem somewhat
out of scope for it.</p>
<p>PS: The systemd-networkd approach for IPv6 proxy NDP definitely
results in an appropriate entry in 'ip -6 neigh show proxy', so
it's not just turning on some form of general proxy NDP and calling
it a day. That's certainly what I'd expect given that you list one
or more proxy NDP addresses, but I like to verify these things.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ModernProxyIPv6AndARP?showcomments#comments">One comment</a>.) </div>Modern proxy (IPv4) ARP and proxy IPv6 NDP on Linux2024-02-26T21:43:53Z2023-11-22T04:32:40Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/Ubuntu2204WireGuardIPv6Gatewaycks<div class="wikitext"><p>Recently we enabled IPv6 on one of our networks here (for initial
testing purposes), but not the network that <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">my office workstation</a> is on. Naturally I decided that I wanted my office
workstation to have IPv6 anyway, by using WireGuard to tunnel IPv6
to it from an IPv6 enabled Ubuntu 22.04 server on that network. For
my sins, I also decided to do this the more or less proper way,
which is to say through <a href="https://www.freedesktop.org/software/systemd/man/systemd-networkd.service.html">systemd-networkd</a>,
instead of through hand-rolled scripts.</p>
<p>(The absolutely proper way would be through <a href="https://netplan.io/">Canonical's netplan</a>, but netplan doesn't currently support WireGuard
or some of the other features that I need, so I have to use
systemd-networkd directly.)</p>
<p>The idea of the configuration is straightforward. My office workstation
has an IPv6-only WireGuard connection to the Ubuntu server, a static
IPv6 address in the subnet's regular /64 that's on the WireGuard
interface, and a default IPv6 route through the WireGuard interface.
The server does proxy <a href="https://en.wikipedia.org/wiki/Neighbor_Discovery_Protocol">NDP</a> for my
office workstation's static IPv6 address and then forwards traffic
back and forth as applicable.</p>
<p>On the server, we have three pieces of configuration. First, we
need to configure the WireGuard interface itself, in a <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.netdev.html">networkd
.netdev file</a>:</p>
<blockquote><pre style="white-space: pre-wrap;">
[NetDev]
Name=ipv6-wg0
Kind=wireguard
[WireGuard]
PrivateKey=[... no ...]
ListenPort=51821
[WireGuardPeer]
PublicKey=[... also no ...]
AllowedIPs=<workstation IPv6>/128,fe80::/64
Endpoint=<workstation IPv4>:51821
</pre>
</blockquote>
<p>We have to allow fe80::/64 as well as the global IPv6 address because
<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/WireGuardAndLinkLocalIPv6">in the end I decided to give this interface some IPv6 link local
IPs</a>.</p>
<p>The second thing we need is a <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.network.html">networkd .network file</a>
to configure the server's side of the WireGuard interface. This must
both set our local parameters and configure a route to the global IPv6
address of my workstation:</p>
<blockquote><pre style="white-space: pre-wrap;">
[Match]
Name=ipv6-wg0
[Network]
# Or make up a random 64 bit address
Address=fe80::1/64
IPForward=yes
# Disable things we don't want
# Some of this may be unnecessary.
DHCP=no
IPv6AcceptRouterAdvertisements=no
LLMNR=false
[Route]
Destination=<workstation IPv6>/128
[Link]
# Not sure of this value, safety precaution
MTUBytes=1359
RequiredForOnline=no
</pre>
</blockquote>
<p>(If I was doing this for multiple machines, I think I would need
one [Route] section per machine.)</p>
<p>The one thing left to do is make the server do proxy <a href="https://en.wikipedia.org/wiki/Neighbor_Discovery_Protocol">NDP</a>, which
has to be set on the Ethernet interface, not the WireGuard interface.
In Ubuntu 22.04, server Ethernet interfaces are managed through
netplan, but netplan has no support for setting up proxy NDP,
although <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.network.html#IPv6ProxyNDP=">networkd .network files support this</a>.
So we must go behind netplan's back. In Ubuntu 22.04, netplan on
servers creates systemd-networkd control files in /run/systemd/network,
and these files have standard names; for example, if your server's
active network interface is 'eno1', netplan will write a
'10-netplan-eno1.network' file. Armed with this we can create a
networkd dropin file in /etc/systemd/network/10-netplan-eno1.network.d
that sets up proxy NDP, which we can call, say, 'ipv6-wg0-proxyndp.conf':</p>
<blockquote><pre style="white-space: pre-wrap;">
[Network]
IPForward=yes
IPv6ProxyNDP=yes
IPv6ProxyNDPAddress=<workstation IPv6>
</pre>
</blockquote>
<p>With all of this set up (and appropriate configuration on my office
workstation), everything appears to work fine.</p>
<p>(On my office workstation, the WireGuard interface is configured
with both the workstation's link local IPv6 address, with a peer
address of the server's link-local address, and its global IPv6
address.)</p>
<p>All of this is pretty simple once I write it out here, but getting
to this simple version took a surprising amount of experimentation
and a number of attempts. Although it didn't help that I decided
to switch to link local addresses after I'd already gotten a version
without them working.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Ubuntu2204WireGuardIPv6Gateway?showcomments#comments">6 comments</a>.) </div>Setting up an IPv6 gateway on an Ubuntu 22.04 server with WireGuard2024-02-26T21:43:53Z2023-11-17T03:59:19Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/DebianHoldPackageAptMarkcks<div class="wikitext"><p>In Debian (and thus Ubuntu), apt-get itself has no support for
selectively upgrading packages, unlike DNF based distributions. In
DNF, you can say 'dnf update package' or 'dnf update --exclude
package' (with wildcards) to only update the package or to temporarily
exclude package(s) from being updated. In apt-get, 'apt-get upgrade'
upgrades everything. In order to selectively upgrade packages in
modern apt-get, you can do 'apt-get install --only-upgrade package'
(although I believe this marks the package as manually installed).
In order to selectively exclude packages from upgrades, you need
to hold them.</p>
<p>When <a href="https://support.cs.toronto.edu/">we</a> started using Ubuntu,
holding and un-holding packages was an awkward process that involved
piping things into 'dpkg --set-selections' and filtering the output
of 'dpkg --get-selections'. Modern versions of Debian's apt suite
has improved this drastically with the addition of the <a href="https://manpages.debian.org/bookworm/apt/apt-mark.8.en.html">apt-mark</a>
command. Apt-mark provides straightforward sub-commands to hold and
unhold packages and to list held packages; 'apt-mark hold package'
(or a list of packages), 'apt-mark unhold package', and 'apt-mark
showhold'. For extra convenience, the package names can include
wildcards and apt-mark will do the right thing, or more or less the
right thing depending on your tastes:</p>
<blockquote><pre style="white-space: pre-wrap;">
apt-mark hold amanda-*
</pre>
</blockquote>
<p>Holding a package name with a wild card will hold everything that
the wildcard matches, whether or not it's installed on your system.
The wildcard above will match and hold the amanda-server package,
which we don't have installed very many places, along with the
amanda-common and amanda-client packages. This is what you want in
some cases, but may be at least unaesthetic since you wind up holding
packages you don't have installed.</p>
<p>If you want to only hold packages you actually have installed you
need a dab of awk and probably you want to use 'dpkg --set-selections'
directly. What we use is:</p>
<blockquote><pre style="white-space: pre-wrap;">
dpkg-query -W 'amanda-*' |
awk 'NF == 2 {print $1, "hold"}' |
dpkg --set-selections
</pre>
</blockquote>
<p>(You can contrive a version that uses apt-mark but since apt-mark
wants the packages to hold on the command line it feels like more
work. Also, as an important safety tip, don't accidentally write
this with 'dpkg' instead of 'dpkg-query' and then quietly overlook
or throw away the resulting error message.)</p>
<p>Holding Debian packages is roughly equivalent to but generally
better than DNF's version-lock plugin. It's explicitly specified
as holding things regardless of version and will hold even uninstalled
packages if you want that, which is potentially useful to stop
things from getting dragged in. I have some things version-locked in
DNF on my Fedora machines and I always feel a bit nervous about it;
we feel no similar concerns on our Ubuntu machines, which routinely
have various packages held.</p>
<p>If you normally have various sensitive packages held to stop
surprise upgrades, the one thing to remember is that pretty much
anything you do to manually upgrade them is going to require you
to re-hold them again. If you want to use 'apt-get upgrade', you
need to un-hold them explicitly; if you 'apt-get install' them to
override the hold, the hold is removed. After one too many accidents,
we wound up automating having some standard holds applied to things
like kernels.</p>
<p>(Apt-mark can also be used to inspect and change the 'manually
installed' status of packages, in case you want to fix this status
for something you ran 'apt-get install' on to force an upgrade.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DebianHoldPackageAptMark?showcomments#comments">One comment</a>.) </div>Holding packages in Debian (and Ubuntu) has gotten easier over the years2024-02-26T21:43:53Z2023-11-09T02:39:20Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NFSv4ServerLockClientscks<div class="wikitext"><p>A while back I wrote an entry about <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerLockClients">finding which NFS client owns
a lock on a Linux NFS server</a>, which turned
out to be specific to NFS v3 (which I really should have seen coming,
since it involved NLM and lockd). Finding the NFS v4 client that
owns a lock is, depending on your perspective, either simpler or
more complex. The simpler bit is that I believe you can do it all
in user space; the more complex is that as far as I've been able
to dig, you have to.</p>
<p>Our first stop for NFS v4 locks is <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ProcLocksNotesIII">the NFS v4 information in
/proc/locks</a>. When you hold a (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/FlockFcntlAndNFS">POSIX, NFS</a>) lock, you will see an entry that looks like
this:</p>
<blockquote><pre style="white-space: pre-wrap;">
46: POSIX ADVISORY READ 527122 00:36:3211286 0 EOF
</pre>
</blockquote>
<p>This may be READ or WRITE, and it might have a byte range instead
of being 0 to EOF. The '00:36:3211286' is the filesystem identifier
(the '00:36' part, which is in hex) and then the inode number (in
decimal, '3211286'). The other number, 527088, is the process ID
of what is holding the lock. For a NFS v4 lock, this will always
be some nfsd process, where /proc/<pid>/comm will be 'nfsd'. You'll
have a number of nfsd processes (threads), and I don't know if it's
always the same PID in <a href="https://man7.org/linux/man-pages/man5/proc.5.html">/proc/locks</a>.</p>
<p>(In addition, read locks can sometimes appear only as DELEG READ
entries in /proc/locks, so they look exactly like simple client
opens. It's possible to see multiple DELEG entries for the same
file, if multiple NFS v4 clients have it open for reading and/or
shared locking. If some NFS v4 client then attempts to get an
exclusive lock to the file, the /proc/locks entry can change to
a POSIX READ lock.)</p>
<p>To find the client (or clients) with the lock, our starting point
is <a href="https://man7.org/linux/man-pages/man7/nfsd.7.html">/proc/fs/nfsd</a>/clients, which
contains one subdirectory for each client. In these subdirectories,
the file '<code>info</code>' tells you what the client's IP is (and the name
it gave the server), and '<code>states</code>' tells you about what things the
particular NFS client is accessing in various ways, including
locking. Each entry in '<code>states</code>' has a type, and this type can
include '<code>lock</code>', and in an ideal world all NFS v4 locks would show
up as a states entry of this type. Life is not so nice for us,
because the state entry for held locks can also be 'type: deleg',
and not all 'type: deleg' entries represent held locks, even for
a file that is locked.</p>
<p>A typical states entry for a NFS v4 client may look like this:</p>
<blockquote><pre style="white-space: pre-wrap;">
- 0x...: { type: lock, superblock: "00:36:3211286", filename: "locktest/fred", owner: "lock id:\x..." }
</pre>
</blockquote>
<p>A 'type: lock' entry can appear for either a shared lock or an exclusive
one. Alternately a states entry can look like this:</p>
<blockquote><pre style="white-space: pre-wrap;">
- 0x...: { type: deleg, access: r, superblock: "00:36:3211286", filename: "locktest/fred" }
</pre>
</blockquote>
<p>It's also possible to see both a 'type: deleg' and a 'type: lock'
states entries for a file that has been opened and locked only once
from a single client.</p>
<p>In all cases, the important thing is the 'superblock:' field, because
this is the same value that appears in /proc/locks.</p>
<p>So as far as I can currently tell, the procedure to find the probable
owners of NFS v4 locks is that first you go through /proc/locks and
accumulate all of the POSIX locks that are owned by a nfsd process,
remembering especially their combined filesystem and inode
identification. Then you go through the /proc/fs/nfsd/clients states
files for all clients, looking for any matching superblock: values
for 'type: lock' or 'type: deleg' entries. If you find a 'type:
lock' entry, that client definitely has the file locked. If you
find a 'type: deleg' entry, the client might have the file locked,
especially if it's a shared POSIX READ lock instead of an exclusive
WRITE lock; however, the client might merely have the file open.</p>
<p>If you want to see what a given NFS v4 client (might) have locked, you
can do this process backward. Read the client's /proc/fs/nfsd/clients
states file, record all superblock: values for 'type: lock' or 'type:
deleg' entries, and then see if they show up as POSIX locks in
/proc/locks. This won't necessarily get all shared locks (which
may show up as merely delegations in both the client's states file
and in /proc/locks).</p>
<p>(Presumably the information necessary to locate the locking client
or clients with more certainty is somewhere in the kernel data
structures. However, so far I've been unable to figure it out in
the way that <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerLockClients">I was able to pull out the NFS v3 lock owner information</a>.)</p>
<p>PS: I'm going to be writing a Python tool for our use based on this
digging, so I may get to report back later with corrections to this
entry. For our purposes we care more about exclusive locks than
shared locks, which makes this somewhat easier.</p>
<h3>Sidebar: /proc/locks filesystem identifiers</h3>
<p>The '00:36' subfield in /proc/locks that identifies the filesystem
is the major and minor device numbers of the <a href="https://man7.org/linux/man-pages/man2/stat.2.html">stat(2)</a> <code>st_dev</code> field
for files, directories, and so on on the filesystem. To determine
these without stat'ing something on every filesystem, you can look
at the third field of every line in /proc/self/mountinfo, with the
provisio that /proc/self/mountinfo's field values are in decimal
and /proc/locks has it in hex.</p>
<p>(Unfortunately <a href="https://man7.org/linux/man-pages/man1/stat.1.html">stat(1)</a> doesn't provide
the major and minor numbers separately, and its unified reporting
doesn't match /proc/locks if the 'minor' number gets large enough.)</p>
</div>
Finding which NFSv4 client owns a lock on a Linux NFS(v4) server2024-02-26T21:43:53Z2023-10-31T03:02:04Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/GetActiveNetworkInterfacescks<div class="wikitext"><p>Suppose, not entirely hypothetically, that <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdResolvedWithDNSResolvers">we want to start using
systemd-resolved on our Ubuntu 22.04 machines</a>. One of the challenges of this
is that the whole networking environment is configured through
<a href="https://netplan.io/">netplan</a>, and in order for systemd-resolved
to work well this means that your netplan configuration must have
your full list of DNS resolvers and DNS search domains. We don't
normally set these in netplan, because it's kind of a pain; instead
we copy in an /etc/resolv.conf afterward.</p>
<p>It is possible to make automated changes to your netplan setup
through <a href="https://netplan.readthedocs.io/en/stable/netplan-set/">netplan set</a>. However,
this needs to know the name of your specific Ethernet device, which
varies from system to system in these modern days. This opens up
the question of how do you get this name, and how do you get the
right name on multi-homed machines (you want the Ethernet device
that already has a 'nameservers:' line).</p>
<p>Netplan has <a href="https://netplan.readthedocs.io/en/stable/netplan-get/">netplan get</a> but by
itself it's singularly unhelpful. There are probably clever ways
to get a list of fully qualified YAML keys, so you could grep for
'ethernets.<name>.nameservers' and fish out the necessary name
there. Since netplan in our Ubuntu 22.04 server setup is relying on
systemd-networkd, we could ask it for information through <a href="https://www.freedesktop.org/software/systemd/man/networkctl.html">networkctl</a>,
but there's no straightforward way to get the necessary information.</p>
<p>(Networkctl does have a JSON output for 'networkctl list', but it's
both too much and too little information. The 'networkctl status'
output is sort of what you want but it's clearly intended for human
consumption, not scripts.)</p>
<p>In practice our best bet is probably to look at where the default
route points, which we can find with '<a href="https://man7.org/linux/man-pages/man8/ip-route.8.html">ip route</a> show default':</p>
<blockquote><pre style="white-space: pre-wrap;">
; ip route show default
default via 128.100.X.Y dev enp68s0f0 proto static
</pre>
</blockquote>
<p>Alternately, we could ask for the route to one of our resolvers,
especially if they're all on the same network:</p>
<blockquote><pre style="white-space: pre-wrap;">
; ip route get 128.100.X.M
128.100.X.M dev enp68s0f0 src 128.100.3.X.Q uid ...
cache
</pre>
</blockquote>
<p>In both cases we can pluck the 'dev <what>' out with something (for
example awk, or 'egrep -o' if you feel conservative). This will
give us the device name and we can then 'netplan set ethernets.<name>...'
as appropriate.</p>
<p>If you have JSON-processing tools handy, modern versions of
<a href="https://man7.org/linux/man-pages/man8/ip.8.html">ip</a> support
JSON output via '-json'. This reduces things to:</p>
<blockquote><pre style="white-space: pre-wrap;">
; ip -json route show default | jq -r .[0].dev
enp68s0f0
; ip -json route get 128.100.X.M | jq -r .[0].dev
enp68s0f0
</pre>
</blockquote>
<p>These days, I think it's increasingly safe to assume you have <a href="https://jqlang.github.io/jq/">jq</a> or some equivalent installed, and
this illustrates why.</p>
<p>In <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdResolvedNotes">the world of systemd-resolved</a>, we
probably want Netplan's 'nameservers:' section attached to the
Ethernet interface that we use to talk to the DNS resolvers even
if our default route goes elsewhere. Fortunately in <a href="https://support.cs.toronto.edu/">our</a> environment it generally doesn't
matter because our Ubuntu servers almost never have more than one
active network interface.</p>
<p>(The physical servers generally come with at least two, but most
machines only use one.)</p>
<p>If we want all interfaces, we can reach for either '<a href="https://man7.org/linux/man-pages/man8/ip-address.8.html">ip -br addr</a>' or '<a href="https://man7.org/linux/man-pages/man8/ip-link.8.html">ip
-br link</a>',
although in both cases we'll need to screen out DOWN links and 'lo',
the loopback interface. If we know that all interesting interfaces
have an IPv4 (or IPv6) address, we can use this to automatically
exclude down interfaces:</p>
<blockquote><pre style="white-space: pre-wrap;">
; ip -4 -br addr
lo UNKNOWN 127.0.0.1/8
enp68s0f0 UP 128.100.X.Q/24
</pre>
</blockquote>
<p>(For IPv6, use -6.)</p>
<p>On some machines this may include a 'virbr1' interface that exists
due to (local) virtual machines.</p>
<p>(In some environments the answer is 'your servers all get this
information through DHCP'. In our environment all servers have
static IPs and static network configurations, partly because that
way they don't need a DHCP server to boot and get on the network.)</p>
<h3>Sidebar: the weird option of looking at the networkd configuration</h3>
<p>Netplan writes its systemd-networkd configuration to /run/systemd/network
in files that in Ubuntu 22.04 are called '10-netplan-<device>.network'.
Generally, even on a multi-interface machine exactly one of those
files will have a 'Gateway=' line and some 'DNS=' and 'Domains='
lines. This file's name has the network device you want to 'netplan
set'.</p>
<p>Actually relying on this file naming pattern is probably a bad idea.
On the other hand, you could find this file and extract the interface
name from it (it appears as 'Name=' in the '[Match]' section, due to
how Netplan sets up basic fixed networking).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/GetActiveNetworkInterfaces?showcomments#comments">2 comments</a>.) </div>Getting the active network interface(s) in a script on Ubuntu 22.042024-02-26T21:43:53Z2023-10-13T03:37:04Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/TOTPMFAWithOathtoolcks<div class="wikitext"><p><a href="https://en.wikipedia.org/wiki/Time-based_one-time_password">Time-Based One-time Passwords (TOTP)</a> are
one of <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/MFABasicOptionsIn2023">the most common ways of doing multi-factor authentication
today</a> and are, roughly speaking,
the only one you can use if the machine you're authenticating on
is a Linux machine. Especially, I believe they're the only one you
can use if you want a command-line way of generating your MFA
authentication codes. While there are a number of programs to
generate TOTP codes, perhaps the most widely available one is
<a href="https://www.nongnu.org/oath-toolkit/oathtool.1.html"><code>oathtool</code></a>,
part of <a href="https://www.nongnu.org/oath-toolkit/index.html">OATH Toolkit</a>.</p>
<p>There are a variety of tutorials on using oathtool to generate
TOTP codes on the Internet, but the ones I read generally slid
into gpg, and gpg is about where I nope out in any instructions.
So here is the simple version:</p>
<blockquote><pre style="white-space: pre-wrap;">
oathtool -b --totp @private/asite/totp-seed
</pre>
</blockquote>
<p>(If you want more familiar syntax, oathtool accepts '-' to mean to
read from standard input, so you can redirect into it or use <code>cat</code>.)</p>
<p>Most websites give you the text form of their TOTP seed in base32,
so we need to tell oathtool that. The totp-seed file should be
unreadable by anyone but you, of course.</p>
<p>If we want somewhat more security we can encrypt the TOTP seed at
rest and pipe it to oathtool:</p>
<blockquote><pre style="white-space: pre-wrap;">
magic-decrypt private/asite/totp-seed | oathtool -b --totp -
</pre>
</blockquote>
<p>The 'magic-decrypt' bit is where common instructions drag in gpg
and I tune out. If I had to do this today, I would use <a href="https://age-encryption.org/">age</a>, which can encrypt (and decrypt) using
a symmetric key with no fuss or muss.</p>
<p>Some TOTP clients have a 'follow' mode where they will print out a
new TOTP code when the clock advances enough to require it. I don't
think oathtool can do this, but it can print out extra TOTP codes
after the current one (with '--window').</p>
<p>And as a little side note, the oathtool in Ubuntu 20.04 appears to
be non-functional for generating TOTP codes from base32 input, for
at least the one website I tried. The version on Ubuntu 22.04 works.
I don't know if this is a bug or some feature that the 20.04 oathtool
doesn't have.</p>
<p>PS: Possibly there is a better command line tool for this that's
packaged in Debian and Ubuntu, but oathtool is what I found in
casual Internet searches. There are definitely other command line
tools, eg <a href="https://github.com/yitsushi/totp-cli">totp-cli</a> and
<a href="https://github.com/arcanericky/totp">totp</a>.</p>
</div>
Brief notes on doing TOTP MFA with <code>oathtool</code>2024-02-26T21:43:53Z2023-10-02T03:12:30Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdResolvedWithDNSResolverscks<div class="wikitext"><p>Probably like many people, we have some machines that are set up
as local DNS resolvers. Originally we had one set for everyone,
both our own servers and <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/CSLabNetworkLayout">other people's machines on our internal
networks</a>, but after some recent
issues we want to make DNS resolution on our own critical servers
more reliable and are doing that <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SplittingDNSResolvers">partly by having a dedicated
private DNS resolver for our servers</a>.
Right now all of our servers do DNS in the old fashioned way, with
a <code>nsswitch.conf</code> that tells them to use DNS and an <code>/etc/resolv.conf</code>
that points to our two (now three on some servers) DNS resolvers.
One of the additional measures I've been considering is whether we
want to consider using <a href="https://www.freedesktop.org/software/systemd/man/systemd-resolved.service.html">systemd-resolved</a>
on some servers.</p>
<p>Systemd-resolved has two features that make it potentially attractive
for making server DNS more reliable. The obvious one is that it
normally has a small cache of name resolutions (the <a href="https://www.freedesktop.org/software/systemd/man/resolved.conf.html#Cache="><code>Cache=</code></a>
configuration directive). Based on '<code>resolvectl statistics</code>' on a
few machines I have that are running systemd-resolved, this cache
doesn't seem to get very big and doesn't get very high a hit rate,
even on machines that are just sitting there doing nothing (and so
are only talking to the same few hosts over and over again). I
certainly don't think we can count on this cache to do very much
if our normal DNS resolvers stop responding for some reason.</p>
<p>The second feature is much more interesting, and it's that
systemd-resolved will rapidly switch to another DNS resolver if
your initial one stops responding. In situations where you have
multiple DNS servers (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdResolvedNotFor">for a given network link or global setting</a>, because systemd-resolved thinks in those
terms), systemd-resolved maintains a 'current DNS server' and will
send all traffic to it. If this server stops responding, resolved
will switch over and then latch on whichever of your DNS servers
is still working. This makes the failure of your 'primary' DNS
server much less damaging than in a pure /etc/resolv.conf situation.
In normal <code>resolv.conf</code> handling, every program has to fail over
itself (and I think some runtime environments may always keep trying
the first listed '<code>nameserver</code>' and waiting for it to time out).</p>
<p>The generally slow switching of nameservers listed in your resolv.conf
means that you really want the first DNS resolver to stay responsive
(whatever it is). Systemd-resolved makes it much less dangerous to
add another DNS resolver along side your regular ones, as long as
you can trust it to not give wrong answers. If it stops working,
those systems using it will switch over to other DNS resolvers fast
enough that very little will notice.</p>
<p>(Unfortunately getting those systems to switch back may be annoying,
but in a sense you don't care whether or not they're using your
special private DNS resolver that's just for them or one of your
public DNS resolvers. If your public DNS resolvers get flooded by
other people's traffic and stop responding, systemd-resolved will
switch the systems back to your private DNS resolver again.)</p>
<p>PS: Of course <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdResolvedLLMNRDelay">there are configuration issues with systemd-resolved
that you may need to care about</a>, but
very little is flawless.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdResolvedWithDNSResolvers?showcomments#comments">2 comments</a>.) </div>Some reasons to combine systemd-resolved with your private DNS resolver2024-02-26T21:43:53Z2023-09-27T03:19:50Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/OOMFromCgroupStatisticWishcks<div class="wikitext"><p>Under <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/OOMKillerWhen">ertain circumstances</a>, Linux will trigger
the Out-Of-Memory Killer and kill some process. For some time, there
have been two general ways for this to happen, either a global OOM
kill because the kernel thinks it's totally out of memory, or a
per-<a href="https://en.wikipedia.org/wiki/Cgroups">cgroup</a> based OOM kill
where <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CgroupsForMemoryLimiting">a cgroup has a memory limit</a>.
These days the latter is quite easy to set up through <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdUserMemoryLimits">systemd
memory limits, especially user memory limits</a>.</p>
<p>The kernel exposes a vmstat statistic for total OOM kills from all
causes, as '<code>oom_kill</code>' in <code>/proc/vmstat</code>; this is probably being
surfaced in your local metrics collection agent under some name.
Unfortunately, as far as I know the kernel doesn't expose a simple
statistic for how many of those OOM kills are global OOM kills
instead of cgroup OOM kills. This difference is of quite some
interest to people monitoring their systems, because a global
OOM kill is probably important while a cgroup OOM kill may be
entirely expected.</p>
<p>Each cgroup does have information about OOM kills in its hierarchy
(or sometimes itself only, if you used the memory_localevents
cgroups v2 mount option, per <a href="https://man7.org/linux/man-pages/man7/cgroups.7.html">cgroups(7)</a>). This
information is in the '<code>memory.events</code>' file, but as covered in
<a href="https://docs.kernel.org/admin-guide/cgroup-v2.html">the cgroups v2 documentation</a>, this file is
only present in non-root cgroups, which means that you can't find
a system wide version of this information in one place. If you know
on a specific system that only one top level cgroup can have OOM
kills, you can perhaps monitor that, but otherwise you need something
more sophisticated (and in theory you might miss transient top level
cgroups, although in practice most are persistent).</p>
<p>The kernel definitely knows this information; the kernel log messages
for global OOM kills are distinctly different from the kernel log
messages for cgroup OOM kills. So the kernel could expose this
information, for example as a new /proc/vmstat field or two; it
just doesn't (currently, as of fall 2023).</p>
<p>(Someday <a href="https://support.cs.toronto.edu/">we</a> may add a Prometheus
cgroups metrics exporter to our host agents in <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">our Prometheus
environment</a> and so collect
this information, but so far I haven't found a cgroup exporter that
I like and that provides the information I want to know.)</p>
</div>
I wish Linux exposed a 'OOM kills due to cgroup limits' kernel statistic2024-02-26T21:43:53Z2023-09-25T03:25:25Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NFSServerRestartLosesNFSv3Lockscks<div class="wikitext"><p>A while back I wrote an article on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Ubuntu2204EnableNFSv4">enabling NFS v4 on an Ubuntu
22.04 fileserver (instead of just NFS v3)</a>,
where one of the final steps was to restart 'nfsd', the NFS server
daemon (sort of), with '<code>systemctl restart nfs-server</code>'. In that
article I said that as far as I could tell this entire process was
transparent to NFS v3 clients that were talking to the NFS server.
Unfortunately I have to take that back. <strong>Restarting '<code>nfs-server</code>'
will cause the NFS server to discard locks obtained by NFS v3
clients</strong>, without telling the NFS v3 clients anything about this.
This results in the NFS v3 clients thinking that they hold locks
while the NFS server believes that everything is unlocked and so
will allow another client to lock it.</p>
<p>(What happens with NFS v4 clients is more uncertain to me; they
may more or less ride through things.)</p>
<p>On Linux, the NFS server is in the kernel and runs as kernel
processes, generally visible in process lists as '<code>[nfsd]</code>'. You
might wonder how these processes are started and stopped, and the
answer is through a little user-level shim, <a href="https://man7.org/linux/man-pages/man8/nfsd.8.html"><code>rpc.nfsd</code></a>. What this
program actually does is write to some files in <a href="https://man7.org/linux/man-pages/man7/nfsd.7.html">/proc/fs/nfsd</a> that control
the portlist, the NFS versions offered, and the number of kernel
nfsd threads that are running. To restart (kernel) NFS service, the
nfs-server.service unit first stops it with 'rpc.nfsd 0', telling
the kernel to run '0' nfsd threads, and then starts it again by
writing some appropriate number of threads into place, which starts
NFS service. The nfs-server.service systemd unit also does some
other things.</p>
<p>(As a side note, you can see what NFS versions your NFS server is
currently supporting by looking at /proc/fs/nfsd/versions. Sadly
this can't be changed while there are NFS server threads running.)</p>
<p>If you restart the kernel NFS server either with '<code>systemctl restart
nfs-server</code>' or by hand by writing '0' and then some number to
/proc/fs/nfsd/threads, the kernel will completely drop knowledge
of all locks from NFS v3 clients. Unfortunately running '<a href="https://man7.org/linux/man-pages/man8/sm-notify.8.html"><code>sm-notify</code></a>' doesn't
seem to recover them; they're just gone.
Locks from NFS v4 clients suffer a somewhat less predictable and
certain fate. If the NFS v4 client is actively doing NFS operations
to the server, its locks will generally be preserved over a '<code>systemctl
restart nfs-server</code>'. If the client isn't actively doing NFS
operations and doesn't do any for a while, I'm not certain that its
locks will be preserved, and certainly they aren't immediately there
(they seem to only come back when the NFS v4 client re-attaches to
the server).</p>
<p>Looked at from the right angle, this makes sense. The kernel has
to release locks from NFS clients when it stops being an NFS server,
and a sensible signal that it's no longer an NFS server is when
it's told to run zero NFS threads. However, it does seem to lead
to an unfortunate result for at least NFS v3 clients.</p>
</div>
Restarting nfs-server on a Linux NFS (v3) server isn't transparent2024-02-26T21:43:53Z2023-09-21T03:12:58Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/UserIOCanBeSystemTimecks<div class="wikitext"><p>Recently, our <a href="https://en.wikipedia.org/wiki/Internet_Message_Access_Protocol">IMAP</a>
server had unusually high CPU usage and was increasingly close to
saturating its CPU. When I investigated with 'top' it was easy to
see the culprit processes, but when I checked what they were doing
with the <code>strace</code> command, they were all busy madly doing IO, in
fact <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/DovecotIndexesAndLIST">processing recursive IMAP <code>LIST</code> commands</a> by walking around in the
filesystem. Processes that intensely do IO like this normally wind
up in <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/IowaitAndCPUUtilization">"iowait"</a>, not in
active CPU usage (whether user or system CPU usage). Except here
these processes were, using huge amounts of system CPU time.</p>
<p>What was happening is that these IMAP processes trying to do recursive
IMAP LISTs of all available 'mail folders' had managed to escape
into '<code>/sys</code>'. The processes were working away more or less endlessly
because <a href="https://www.dovecot.org/">Dovecot</a> (the IMAP server
software we use) makes the entirely defensible but less common
decision to follow symbolic links when <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/DirectoryTraversalAndSymlinks">traversing directory trees</a>, and Linux's <code>/sys</code> has a
lot of them (and may have ones that form cycles, so a directory
traversal that follows symbolic links may never terminate). Since
<code>/sys</code> is a virtual filesystem that is handled entirely inside the
Linux kernel, traversing it and reading directories from it does
no actual IO to actual disks. Instead, it's all handled in kernel
code, and all of the work to traverse around it, list directories,
and so on shows up as system time.</p>
<p>Operating on a virtual filesystem isn't the only way that a program
can turn a high IO rate into high system time. You can get the same
effect if you're repeatedly re-reading the same data that the kernel
has cached in memory. Since the kernel can satisfy your IO requests
without going to disk, all of the effort required turns into system
CPU time inside the kernel. This is probably easiest to have happen
with reading data from files, but you can also have programs that
are repeatedly scanning the same directories or calling <code>stat()</code>
(or <code>lstat()</code>) on the same filesystem names. All of those can wind
up as entirely in-kernel activities because the modern Linux kernel
is very good at caching things.</p>
<p>(Most people's IMAP servers don't have the sort of historical
configuration issues we have that create these exciting adventures.)</p>
</div>
A user program doing intense IO can manifest as high system CPU time2024-02-26T21:43:53Z2023-09-14T02:11:06Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NFSv4IdmapdDomaincks<div class="wikitext"><p>As mentioned in the <a href="https://man7.org/linux/man-pages/man5/nfsidmap.5.html">nfsidmap(5)</a> manual page,
NFS v4 represents UIDs and GIDs as 'id@domain' strings in contexts
like stat(2) results and thus, for example, 'ls -l' output (this
was explained to me in a comment on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSv4BasicsJustWork">this entry</a>).
If you want your NFS v4 mounts to look like your NFS v3 mounts and
work transparently, the server and the client need to agree on the
domain, although the exact domain probably doesn't matter. As I
mentioned in <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Ubuntu2204EnableNFSv4">my entry on enabling NFS v4</a>,
I feel that you might want to set this explicitly rather than count
on Linux getting it right (on both the server and all of the clients).</p>
<p>The <a href="https://man7.org/linux/man-pages/man5/nfsidmap.5.html">nfsidmap(5)</a> manual page more or less documents how Linux
determines the NFS v4 domain. First, it checks for an explicit
setting in /etc/idmapd.conf:</p>
<blockquote><pre style="white-space: pre-wrap;">
[General]
Domain = cs.toronto.edu
</pre>
</blockquote>
<p>If NFS v4 finds no explicit setting, it starts going through a
lookup process, which you can find the code for in
'<code>domain_from_dns()</code>' in <a href="http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=blob;f=support/nfsidmap/libnfsidmap.c">support/nfsidmap/libnfsidmap.c</a>.
We get the system's hostname, look up its IP with <a href="https://man7.org/linux/man-pages/man3/gethostbyname.3.html">gethostbyname(3)</a> (which
doesn't necessarily actually do DNS), take the '<code>h_name</code>' field
of the '<code>struct hostent</code>' returned by <a href="https://man7.org/linux/man-pages/man3/gethostbyname.3.html">gethostbyname(3)</a> as the
host's fully qualified name, and take the 'domain' to be everything
after the first dot. If <a href="https://man7.org/linux/man-pages/man3/gethostbyname.3.html">gethostbyname(3)</a> failed and the hostname
has a dot in it, everything after that dot is the 'domain' (otherwise
we fail). Having determined the domain, we do a DNS TXT lookup for
'<code>_nfsv4idmapdomain.<domain></code>'. If that returns a result, it's
taken as our NFS v4 domain; otherwise, the previously determined
'DNS' domain is our NFS v4 domain.</p>
<p>I'm not quite sure exactly when idmapd or other programs act to
determine the NFS v4 domain, but it seems to happen no later than
when they first have to do this translation for NFS v4 requests and
responses. The current code caches the result, rather than redoing
the lookup every time, so I believe the first result obtained will
be sticky until the relevant daemon is restarted (I think usually
<a href="https://man7.org/linux/man-pages/man8/idmapd.8.html">idmapd(8)</a>).</p>
<p>(As far as I can tell, all of this is the same on both the NFS
server and all of the NFS clients. As far as I know they all run
idmapd and so run this code, although obviously each of them has
their own idmapd.conf, their own hostname, and so on.)</p>
<p>Given this, we can make some observations. First, the default
(DNS-based) NFS v4 domain determination gives the same result for
all hosts under a particular subdomain. if you have two groups
separately operating NFS v4 fileservers under a common DNS subdomain,
such as 'dev.example.org', they're both going to get the same NFS
v4 domain name even if they have different (or even conflicting)
Unix login names. Probably you want one or both of them to set an
explicit NFS v4 domain name in /etc/idmapd.conf.</p>
<p>If you use the _nfsv4idmapdomain DNS TXT lookup feature to
provide your NFS v4 domain name, you're obviously dependent on DNS
working. Otherwise it depends on how you have your hostnames set
up. If your hostnames are the system's fully qualified domain name
(so 'fred.dev.example.org'), then even if DNS isn't working you'll
get the same NFS v4 domain name because the current code will fall
back to the 'everything after the first dot in the hostname' case.
If you have your hostnames set to either the bare hostname (so
'fred') or the hostname with your subdomain only (so 'fred.dev'),
the hostname fallback will either fail or generate a different
result than normal, and so you're dependent on DNS to get the right
NFS v4 domain name.</p>
<p>(If you have NFS v4 clients in different DNS subdomains and they
all mount from a NFS v4 server, you definitely need to set the
domain explicitly on some or all of them. The same is true if
your NFS v4 server(s) are in a different DNS subdomain than all
of the clients.)</p>
<p>If the default of your DNS domain is good enough as your NFS v4
domain name, setting an explicit domain in /etc/idmapd.conf is only
insurance against odd DNS issues or outright DNS failures. It also
insulates you against accidents with /etc/hosts and other <a href="https://man7.org/linux/man-pages/man5/nsswitch.conf.5.html">nsswitch.conf</a> fun
and games (for example, with <a href="https://www.freedesktop.org/software/systemd/man/nss-myhostname.html">nss-myhostname</a>).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSv4IdmapdDomain?showcomments#comments">One comment</a>.) </div>Linux NFS v4 idmapd domain handling and server/client agreement2024-02-26T21:43:53Z2023-08-27T02:33:29Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/Ubuntu2204EnableNFSv4cks<div class="wikitext"><p>In Ubuntu 22.04 and other modern Linux distributions, the way you
prevent <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">your fileservers</a> from doing <a href="https://en.wikipedia.org/wiki/Network_File_System">NFS</a> v4 is to set
some options in <a href="https://man7.org/linux/man-pages/man5/nfs.conf.5.html">/etc/nfs.conf</a>. The default
is to support NFS v4, so you need to change that:</p>
<blockquote><pre style="white-space: pre-wrap;">
[nfsd]
vers4=n
vers4.0=n
vers4.1=n
vers4.2=n
</pre>
</blockquote>
<p>In theory, <a href="https://man7.org/linux/man-pages/man8/rpc.nfsd.8.html">rpc.nfsd(8)</a> says that
setting all of these is overkill and all you need is '<code>vers4=n</code>'.
I haven't actually tested that; we turned everything off. You can
set this through /etc/nfs.conf.d/ drop in files if you want to,
instead of directly changing /etc/nfs.conf. As I found out, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/MountdNoNFSv4SwitchPointless">it's
not necessary to tell mountd to not do NFS v4</a>, and in any case it also turns out
that <a href="https://man7.org/linux/man-pages/man8/mountd.8.html">mountd</a>
reads /etc/nfs.conf too for the 'vers...' settings.</p>
<p>But suppose, not hypothetically, that you want to take your NFS v3
only fileservers and make them support basic non-Kerberos NFS v4
as well, because you're in the process of moving to NFS v4 for
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Ubuntu2204NFSv3LockingProblem">reasons</a>. What do you need to do,
and how disruptive is it? Based on our experimentation, here are
the answers for Ubuntu 22.04 fileservers and NFS clients.</p>
<ul><li>Probably, explicitly configure your NFS v4 ID to name domain
in /etc/idmapd.conf. This is normally deduced from DNS information,
as covered in <a href="https://man7.org/linux/man-pages/man8/idmapd.8.html">idmapd(8)</a>, but you may
want to set it to be sure that NFS servers (and clients) aren't
confused if you have DNS problems for some reasons (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/DNSResolverQueryLimitsIssue">which we've
had happen in the past</a>).</li>
<li>If you changed idmapd.conf, restart the server: '<code>systemctl restart
nfs-idmapd</code>'. It's harmless to do this while you're not doing
NFS v4, even if it changes your NFS v4 domain.<p>
(Eventually you'll want to make this change on your NFS clients
as well.)<p>
</li>
<li>Reverse your changes to nfs.conf so that NFS v4 is enabled; this
is the default state, so just removing your changes is enough.<p>
</li>
<li>Make your new nfs.conf NFS version settings take effect by
'<code>systemctl restart nfs-server</code>'.</li>
</ul>
<p>A normal Ubuntu 22.04 NFS server already has all of the NFS v4 services
you need enabled and started by default (such as nfs-idmapd). It's just
that until you enabled NFS v4, they probably weren't doing anything.
This makes for a pleasantly minimal change.</p>
<p>You might wonder how systemd restarts the kernel NFS server, since
it's not a daemon process that can be stopped and started conventionally.
How this works is that stopping the kernel NFS server is done by
running '<code>rpc.nfsd 0</code>', which tells the kernel to have '0' NFS kernel
threads and thus shuts down the kernel NFS server. When rpc.nfsd starts
the kernel NFS server up again (with some number of threads), it passes
in your new 'support v4' information.</p>
<p>(Except it turns out that rpc.nfsd does this by writing information
to /proc/fs/nfsd/versions, which is sadly not covered in <a href="https://man7.org/linux/man-pages/man7/nfsd.7.html">nfsd(7)</a>. This file has
an obvious format, but can only be written to if there are no kernel
NFS server threads.)</p>
<p>As far as I can tell, this process is transparent to a NFS (v3)
client with filesystems mounted or even active at the time you do
this on the fileserver. The client may experience a brief pause in
NFS server response when the server restarts, but if so it wasn't
enough to cause client kernel messages to get logged about it. As
you would expect, the client's NFS mounts don't magically get changed
from NFS v3 mounts to NFS v4 mounts; instead this will only happen
when you unmount the filesystem and mount it again.</p>
<p>(A discussion of NFS v4 ID mapping and why you probably want to
explicitly configure your domain is beyond the scope of this entry.)</p>
</div>
Enabling NFS v4 on an Ubuntu 22.04 fileserver (instead of just NFS v3)2024-02-26T21:43:53Z2023-08-26T03:19:39Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/MountdNoNFSv4SwitchPointlesscks<div class="wikitext"><p>We're long time users specifically of NFS v3, not NFS v4, and so for a
long time we also did everything we could to disable NFS v4 on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our
NFS servers</a>. When our NFS servers became Linux
ones in 2018, one of the things we did was to run <a href="https://man7.org/linux/man-pages/man8/mountd.8.html"><code>mountd</code></a> with a command
line option to disable NFS v4:</p>
<blockquote><pre style="white-space: pre-wrap;">
/usr/sbin/rpc.mountd --no-nfs-version 4
</pre>
</blockquote>
<p>(This required a customized nfs-mountd.service systemd unit.)</p>
<p>It's likely that we're going to move to NFS v4 <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Ubuntu2204NFSv3LockingProblem">due to NFS v3
locking problems</a>, and as part of
that we're testing the changes we need to make to our existing
fileservers. As part of that, today I noticed that all of our
fileservers were running mountd with this argument. Including <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSv4BasicsJustWork">the
fileservers that we'd enabled NFS v4 on</a> and
had made NFS v4 mounts from.</p>
<p>Based on looking at the mountd code (in <a href="http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=blob;f=utils/mountd/mountd.c">mountd.c</a>,
turning NFS v4 off (or on) in mountd has absolutely no effect. It
doesn't even stop a modern mountd from logging NFS v4 client
connections and disconnections, which look like this:</p>
<blockquote><pre style="white-space: pre-wrap;">
v4.2 client attached: 0x41a6accf64e644e2 from "128.100.X.X:840"
v4.2 client detached: 0x41a6accf64e644e2 from "128.100.X.X:840"
</pre>
</blockquote>
<p>At one level this is unsurprising, because in <a href="https://en.wikipedia.org/wiki/Network_File_System#NFSv4">NFS v4</a> the process
for mounting exports was moved into the main NFS protocol and so
the mountd daemon is no longer involved in it. Whether or not you
can do NFS v4 mounts from a Linux fileserver is purely up to the
kernel NFS server operating on port 2049, which depends on whether
NFS v4 was enabled in the kernel server (which in turn is configured
when <a href="https://man7.org/linux/man-pages/man8/nfsd.8.html">rpc.nfsd</a>
runs, and is normally set and controlled in <a href="https://man7.org/linux/man-pages/man5/nfs.conf.5.html">/etc/nfs.conf</a>).</p>
<p>At another level I wish mountd had told me at some point, even in
a warning message. We've been setting a pointless option and slightly
complicating our fileserver installs for years.</p>
<p>(You might wonder how mountd notices NFS v4 clients attaching and
detaching. The answer is in <a href="http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=blob;f=support/export/v4clients.c">support/export/v4clients.c</a>,
and is that mountd monitors /proc/fs/nfsd/clients for things
appearing, disappearing, and changing. Mountd could skip doing this
if you told it that NFS v4 was disabled, but currently it doesn't.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/MountdNoNFSv4SwitchPointless?showcomments#comments">One comment</a>.) </div>Giving Linux mountd a '--no-nfs-version 4' argument does nothing2024-02-26T21:43:53Z2023-08-24T01:48:00Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/Ubuntu2204NFSv3LockingProblemcks<div class="wikitext"><p>I've recently written about things like <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerLockClients">finding who owns NFS v3
locks on a Linux server</a>, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerBreakingLocks">breaking NFS locks
on 22.04</a>, and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSv4BasicsJustWork">experimenting with NFS v4</a>, where I mentioned in an aside that NFS v4
seemed better regarded for file locking. All of this work has been
quietly motivated by it becoming obvious to us that we have some
sort of NFS (v3) file locking problem on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our Ubuntu 22.04 ZFS
fileservers</a>.</p>
<p>Specifically, what we're seeing is <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/NFSLocksStuckWorkaround">stuck NFSv3 locks</a>, where the NFS fileserver thinks
that a NFS client holds a lock but the NFS client's kernel disagrees.
This problem is new in Ubuntu 22.04 (we didn't see it on 18.04),
and seems to occur mostly for our IMAP server as it accesses people's
home directories. When it happens, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our fileservers</a> will claim that the IMAP server has a lock
on some mailbox in someone's home directory, but the IMAP machine
has no idea of it. At this point all further attempts to access the
mailbox in question hang, because Dovecot attempts to get a lock
first and that will fail.</p>
<p>(It's possible that other NFS clients are also seeing this issue
but the symptoms are less obvious on them. On the other hand, I
believe most of our NFS clients do very little NFS locking, and
presumably the volume of lock activity is a factor in triggering
this.)</p>
<p>Our habit with all of <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our NFS fileservers</a>
is to freeze their kernel version unless there's a compelling reason
to go through the risks of an upgrade, so they're all behind the
current 22.04 kernels; these days, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/LocalVarMailImprovement">this includes our IMAP server</a>. On the other hand, the Ubuntu
kernel source doesn't seem to have any changes to the relevant
sections of code from the kernel versions we're running, and I
didn't see anything in the changelogs. If upgrading the kernel fails
to resolve the problem (and I suspect that it won't help), then the
only other option I can see is moving to NFS v4 in the hopes that
its locking won't have the same issues. This is a rather bigger
change, and correspondingly is riskier, but at some point we may
have no real choice.</p>
<p>There are no kernel messages being logged on either the IMAP machine
or the ZFS fileservers. It's probably possible to use kernel
instrumentation to trace NFS lock and unlock operations on both the
server and clients in order to try to spot the point where an unlock
either fails or isn't done, but since very few lock operations go
wrong this would be a very high volume activity with relatively
little signal.</p>
<p>(And in general NFS v3 locks aren't very inspectable on the NFS
server; <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerLockClients">you have to resort to diving into the kernel internals</a> to get what should be straightforward system
management information. <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ProcLocksNotesIII">It's rather easier to get information about
NFS v4 locks on the fileserver</a>.)</p>
<p>(Our use of ZFS may be a contributing factor here, per <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxRisksWithNFS">the potential
risks of using (Open)ZFS on Linux</a>.)</p>
</div>
We have a NFS v3 locking problem on our Ubuntu 22.04 servers2024-02-26T21:43:53Z2023-08-23T03:19:07Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/PackagingTakesWorkcks<div class="wikitext"><p>This should not be news, but <a href="https://mastodon.social/@cks/110631003280644912">I feel like saying it anyway</a> in light
of some recent events in the Linux distribution ecology:</p>
<blockquote><p>Hot take: packaging open source software is actual work (and is
sometimes what we demurely call 'non-trivial' in this field). I say
this as a sysadmin who has sometimes had to deal with the results of
not packaging software and then not keeping up with the state of the
software we didn't package but installed anyway.</p>
<p>(Sure, sometimes you get lucky and the packaging instructions are easy
to write (Debian rules, RPM specfiles, whatever Arch uses, etc). And
sometimes they aren't.)</p>
</blockquote>
<p>Part of the work of packaging software is in identifying and
collecting all of its dependencies, in the right version, and making
sure that the versions are all coherent with the rest of the system.
Some of it is in testing the resulting whole system. A certain
amount of it is in making a particular piece of software conform
to the standards you've set up for a particular environment or
distribution; for example, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DebianRightApacheConfig">Debian has a specific scheme for how
Apache is configured</a>, and the general
idea is used by Debian for a bunch of other software. Sometimes
you fix bugs or pull in as yet unreleased changes.</p>
<p>(Debian also writes a bunch of manpages.)</p>
<p>But a significant amount of packaging software is in turning a 'make
install' into something that is both confined and reversible. so
that people can reliably uninstall the package again. <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PackageSystemImportance">Reliable
and easy uninstalls is one of the big features of a good package
system</a>. Sometimes this is
simple, because the software only installs a few files in a few
places and you can easily list them all in the packaging instructions.
Sometimes this is a lot of work because the software normally wants
all sorts of things from the system, like new accounts and changes
to other software and so on.</p>
<p>(Various forms of containers all more or less punt on this, right
up to the extreme version of shipping a little Unix user space
system image. I take this as an illustration of how hard it can be
to do this; shipping an entire mini-Unix that's had your 'make
install' executed in it is certainly one brute force way out.)</p>
<p>PS: A related issue is that <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PackagingMustBeInformed">distribution packaging needs to
consider how the upstream handles things</a>.
This is a subset of how packing isn't a one time blind thing; to
do it well, you need to know a certain amount about the upstream
and then to keep track of it (per my 'keeping up with the state of
the software' above). That's work too, and work that necessarily
stretches into the future.</p>
</div>
Packaging software is something that takes work2024-02-26T21:43:53Z2023-08-22T02:05:45Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SSHToOldHostsOnFedora37cks<div class="wikitext"><p>Suppose, not entirely hypothetically, that you have an old embedded
device that's accessible over SSH (as well as other methods) and you
want to SSH into its console to see if you can get more information
than it exposes in its web interface. You've done this before, but
not for a while, and now when you try it on your Fedora 37 desktop:</p>
<blockquote><pre style="white-space: pre-wrap;">
; ssh root@dsl-modem
Unable to negotiate with dsl-modem port 22: no matching host key type found. Their offer: ssh-rsa
</pre>
</blockquote>
<p>This is <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/OpenSSHAndSHA1DeprecationII">OpenSSH's deprecation of the SHA1-based 'ssh-rsa' signature
scheme</a> in action. This
particular device is so old that it only supports ssh-rsa (and some
obsolete <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OpenSSHUnderstandingKeyOptions">key exchange algorithms and ciphers</a>, which I had already
had to re-enable earlier).</p>
<p>So I stuck '<code>HostKeyAlgorithms +ssh-rsa</code>' in my .ssh/config stanza
for this host and tried again, only to get the same error. It turns
out that this was incomplete and I also needed to add
'<code>PubkeyAcceptedAlgorithms +ssh-rsa</code>' (despite not doing user public
key authentication with this host). At this point I got another
error:</p>
<blockquote><pre style="white-space: pre-wrap;">
; ssh root@dsl-modem
Bad server host key: Invalid key length
</pre>
</blockquote>
<p>This is because Fedora raise the minimum RSA key size to 2048 bits,
and this old device also has an old, smaller key (probably 1024
bits, I haven't checked exactly). To set this, you need '<code>RSAMinSize
1024</code>'.</p>
<p>So, for this particular old device, I need all of the following
in my .ssh/config stanza for it:</p>
<blockquote><pre style="white-space: pre-wrap;">
KexAlgorithms +diffie-hellman-group1-sha1
HostKeyAlgorithms +ssh-rsa
PubkeyAcceptedAlgorithms +ssh-rsa
RSAMinSize 1024
Ciphers +3des-cbc
</pre>
</blockquote>
<p>I've listed these options in the order that I would discover that
I needed them if I was starting from scratch. First I'd need a key
exchange algorithm that both sides supported, then I would need
support for ssh-rsa keys, and finally I'd need a cipher that both
sides supported. The only mysterious one is the ssh-rsa case, where
I don't know why I need two configuration settings to enable this.</p>
<p>(This is the kind of entry I write because I never want to have to
work this out again, and maybe if it happens with a different key
type I'll remember that I needed to fiddle two options, not just
the obvious one.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SSHToOldHostsOnFedora37?showcomments#comments">2 comments</a>.) </div>What I need to SSH to old hosts on Fedora 37 (and probably later)2024-02-26T21:43:53Z2023-08-18T02:20:18Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/BlueToothForBackupInternetcks<div class="wikitext"><p>Suppose, <a href="https://mastodon.social/@cks/110873478011842341">not entirely hypothetically</a>, that your normal
DSL Internet connection is down (for example, because the local
phone company did something to your line and hasn't fixed it yet),
and you need to get Internet by tethering <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2018">your Linux desktop
machine</a> to <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/SmartphoneWhyIPhone">your smartphone</a>. The easiest way to do this is to be
using a modern Linux desktop along with NetworkManager and so on;
at that point you can basically click through the various GUIs to
connect to your phone's hotspot through wifi, a direct USB connection,
or BlueTooth, depending on what you have available. This will
handle joining the phone's ad-hoc wifi network, pairing over USB
and/or BlueTooth, and all of the other setup you need. However,
<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/MyDesktopTour">I don't use a modern Linux desktop</a>.</p>
<p>I suspect that the second easiest way is to at least have NetworkManager
around and running, so that you can configure and activate those
connections by hand, possibly with command line tools like <code>nmcli</code>.
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NetworkManagerWhyConsidering">Although I've considered this sort of use of NetworkManager</a>, I haven't yet set it up so I can't
say how well it works.</p>
<p>The third easiest way, <a href="https://mastodon.social/@cks/110873781211771568">at least for me</a>, is to use a
direct USB connection between my desktop and my phone. Fedora 37
(and previous versions) more or less magically makes this work (for
my phone) so that an '<code>eth0</code>' USB Ethernet device appears. Once the
phone's network connection is present, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DHCPForBackupInternet">I can start a DHCP client
by hand on that interface to get connected</a>
and then <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WireGuardBehindNAT">bring up my WireGuard tunnel through the phone's NAT</a>. Connecting the phone via USB also is probably
the fastest and most reliable connection between the two, and it
has the side benefit of keeping the phone charged instead of running
down its battery.</p>
<p>However, USB has the small drawback of using a cable, which is a
potential problem if for some reason you need to walk around while
carrying your phone (for example, <a href="https://mastodon.social/@cks/110879818831215145">if you're waiting for a call
from your phone company</a>).
My home desktop doesn't have wifi, but it does have a USB BlueTooth
dongle I got for other reasons and my phone supports tethering over
BlueTooth, so today I gave it a shot. It works, but the command
line incantations are a bit obscure and I couldn't find much in
Internet searches.</p>
<p>I'm not sure how I got my phone and my desktop paired over BlueTooth,
but it appears I managed this trick at some point. Possibly I used
to GUI tools from <a href="https://wiki.archlinux.org/title/Blueman">Blueman</a>,
but it looks like I have a bunch of the tools listed in <a href="https://wiki.archlinux.org/title/bluetooth">the Arch
Wiki page on BlueTooth</a>,
so I'm not sure which one I used. In any case, once the two had
been paired in the past, I could enable the connection with
'<code>bluetoothctl devices</code>' to get the MAC and then '<code>bluetoothctl
connect <MAC></code>' to connect my desktop to my phone. However, this
doesn't automatically create any sort of network device. To do that,
I needed to run an additional command, '<code>bt-network -c <MAC> nap</code>';
this command stays in the foreground and creates a '<code>bnep0</code>' network
device. After that, I did <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DHCPForBackupInternet">my manual DHCP client stuff</a> on '<code>bnep0</code>' instead of '<code>eth0</code>' and it all
worked the same.</p>
<p>(In theory <a href="https://man.archlinux.org/man/bt-network.1.en">bt-network</a>
supports 'gn', 'panu', or 'nap' connection modes and I sort of
thought I wanted 'panu' (see for example <a href="https://bluez.sourceforge.net/contrib/HOWTO-PAN">this page</a>). In practice
trying either 'panu' or 'gn' caused bt-network to dump core, while
using 'nap' made everything work even if it's not supposed to have.
On Fedora, bt-network comes from the bluez-tools package.)</p>
<p>I haven't tried to measure the speed or latency of BlueTooth as
compared with the USB connection. I would expect it to be worse,
and in general I'd rather use USB if I have a choice (and on something
with wifi, such as my work laptop, I'd try to use wifi as a fallback,
even though it's awkward in my current specific situation).</p>
<p>(There's <a href="https://wiki.archlinux.org/title/IPhone_tethering">an Arch Wiki guide on this</a>, but its by
hand example uses something called '<code>pand</code>', which is apparently a
long obsolete and removed program from older versions of BlueZ.)</p>
<p>PS: I have no idea how much of this would apply to Android phones.</p>
<p>PPS: An interesting side effect of connecting my desktop to my phone
over BlueTooth is that my desktop's audio became the phone's default
audio output. Probably there are ways to control this so that my
desktop doesn't advertise to the phone that it's an audio device
and I'm only using the phone for its Internet connection, but right
now I can't be bothered to figure them out. And in the mean time
it's cute to find out that the phone even sends track information
to my desktop (which I saw by leaving 'bluetoothctl' running after
I'd (re)connected).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/BlueToothForBackupInternet?showcomments#comments">2 comments</a>.) </div>Getting my backup Internet connection through BlueTooth on Linux2024-02-26T21:43:53Z2023-08-13T02:52:59Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NFSv4MountsNewFilesystemTypecks<div class="wikitext"><p>For more or less historical reasons, we currently use NFS v3 to
mount filesystems from <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our fileservers</a>.
We're likely going to (slowly) move over to NFS v4 (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSv4BasicsJustWork">after successful
experiments</a>), so I've been working on various
preparations for that, such as making sure <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/AutomounterReplacement">our automounter system</a> doesn't specifically force NFS v3
but instead leaves it up to the system to pick a version. In the
process of this, I've discovered a surprise.</p>
<p>On Linux, <strong>NFS v4 mounts have the filesystem type 'nfs4', not
'nfs'</strong>; the 'nfs' filesystem type is for NFS v3 (and NFS v2, if
you're using that, which you probably shouldn't be). Linux's normal
'<code>mount</code>' program will accept '<code>mount -t nfs ...</code>' and do a NFS v4
mount if the server supports it, but listing mounts with '<code>mount
-t nfs</code>' will only list NFS v3 ones, and the actual filesystem type
in /proc/mounts (aka /proc/self/mounts) and other things is 'nfs4'.</p>
<p>This ripples through to all sorts of things. If you're listing 'all
NFS mounts', you need to use '<code>mount -t nfs,nfs4</code>'. If you're
configuring <a href="https://github.com/prometheus/node_exporter">the Prometheus host agent</a> to exclude NFS mounts
from the filesystems it reports on, now you need to exclude another
filesystem type ('nfs4'). If you have something that scans
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSMountstatsIndex">/proc/self/mountstats</a> to report 'NFS' mount
information, you need to accept both 'fstype nfs' and 'fstype nfs4'
(or possibly handle 'nfs4' differently, since it has different NFS
operations). Once you're collecting information on NFS v4 mounts,
you may then need to make additional changes to things like alerts,
metrics dashboards, and so on to either include or exclude NFS v4
filesystems as appropriate.</p>
<p>I'm sure the Linux kernel has good internal reasons for doing this,
regardless of how I find it inconvenient for our purposes. The two
different filesystem types are defined and seem to be used in
<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/nfs/fs_context.c">fs/nfs/fs_context.c</a>,
where they seem to have mostly the same contents in their '<code>struct
file_system_type</code>' but the kernel code distinguishes between them
when parsing mount options (in <code>nfs_fs_context_parse_monolithic()</code>
and the code it calls). Possibly this creates additional changes
later in things like what VFS operations are supported; I haven't
read the kernel code in that much detail.</p>
<p>(NFS v3 and NFS v4 mounts have kernel level mount options that give
you the NFS version involved. A NFS v3 mount will have '<code>vers=3</code>'
and I believe always '<code>mountvers=3</code>'; a NFS v4 mount has a '<code>vers</code>'
that has things like '4.2', depending on the NFS v4 sub-version
you wind up using.)</p>
</div>
On Linux, NFS v4 mounts are a different filesystem type than NFS (v3) mounts2024-02-26T21:43:53Z2023-08-05T01:34:20Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/PrometheusSystemdRestartMetriccks<div class="wikitext"><p>I recently wrote about <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdRestartHidesProblems">how systemd's auto-restart of units can
hide problems</a>, where we discovered
this was hiding failures of <a href="https://github.com/prometheus/node_exporter">the Prometheus host agent</a> itself. This raises
the question of how and if we can monitor for this sort of thing
happening with <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">our Prometheus setup</a>. The answer turns out to
be more or less yes.</p>
<p>The host agent has a systemd collector, which as of 1.6.1 isn't
enabled by default (you enable it with '--collector.systemd'). This
collector has several additional pieces of information it can collect
from systemd; with '--collector.systemd.enable-restarts-metrics'
it will collect metrics on 'restarts', and with
'--collector.systemd.enable-start-time-metrics' it will collect
metrics on the start times of units. The first option enables a
node_systemd_service_restart_total metric and the second
enables a node_systemd_unit_start_time_seconds metric.</p>
<p>The unit start time metric is pretty straightforward; it's the Unix
timestamp of when the unit was last started, or '0' if the unit has
never been started. This includes units that have started but exited,
so you'll see the start time of a whole bunch of boot time units.
For units that aren't supposed to restart, you can detect persistent
restarts by an alert rule like this, although you'll definitely
want to be selective about what units you alert on (which I've
omitted from this example):</p>
<blockquote><pre style="white-space: pre-wrap;">
- alert: AlwaysRecentRestarts
expr: (time() - node_systemd_unit_start_time_seconds) < (60*2)
for: 10m
</pre>
</blockquote>
<p>(This is a similar idea to <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusDoingRebootAlerts">detecting recent reboots</a>; I'm using <code>time()</code>
instead of node_time_seconds so I don't have to wrestle with
label issues.)</p>
<p>The node_systemd_service_restart_total metric counts the number
of times a systemd unit has been restarted by a 'Restart=' trigger
since the last time the unit was started or restarted normally. In
george's comment on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdRestartHidesProblems">my entry</a>, this
is 'involuntary' restarts versus 'voluntary' ones, and the information
comes from the systemd 'NRestarts' unit property.</p>
<p>Because this metric is reset to zero if you manually restart a unit,
in Prometheus terms you may want to consider this a gauge, not a
counter. However for many purposes using <code>rate()</code> instead of
<a href="https://prometheus.io/docs/prometheus/latest/querying/functions/#delta"><code>delta()</code></a>
probably makes for an alert that's more likely to trigger if things
keep restarting. You might want to write a PromQL alert expression
like this:</p>
<blockquote><pre style="white-space: pre-wrap;">
rate( node_systemd_service_restart_total[10m] ) > 3
and ( node_systemd_service_restart_total > 0 )
</pre>
</blockquote>
<p>The second clause avoids triggering the alert if you've manually
restarted the service since the last automatic restart.</p>
<p>Looking at metrics for our Ubuntu machines, I see a small number
of services that appear to auto-restart as an expected thing,
particularly 'getty@' and 'serial-getty@' services. Your local
environment may have others, so you probably want to check your
local systems to see what your services are like.</p>
<p>Whether you want to alert on too many automatic restarts (whatever
'too many' is for you), frequent restarts, or the inability of a
service to stay up for long is something that you'll have to decide
yourself. <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdRestartHidesProblems">Our particular case</a>
wouldn't have triggered either of the example rules I've given here,
because the Prometheus host agent wasn't crashing all that often
(probably less than once a day, although I didn't really check).
Only an alert on 'there have been too many automatic restarts of
this' would have picked up the problem.</p>
<p>(Our case is tricky because <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdRestartUseDelay">the host agent can die and be restarted
in situations that are more or less expected</a>,
like the host being out of memory. We don't really want to get a
cascade of alerts about that.)</p>
</div>
The Prometheus host agent's metrics for systemd unit restarts2024-02-26T21:43:53Z2023-08-03T02:46:05Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdRestartHidesProblemscks<div class="wikitext"><p>Today, more or less by coincidence, I discovered that <a href="https://github.com/prometheus/node_exporter">the Prometheus
host agent</a> on our
Linux machines was periodically crashing with an internal Go runtime
error (which had already been noticed by other people and filed as
<a href="https://github.com/prometheus/node_exporter/issues/2705">issue #2705</a>).
You might wonder how we could not notice the host agent for <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">our
monitoring, metrics, and alerting system</a> doing this, and part of the
answer is that the systemd service has a setting of '<a href="https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart="><code>Restart=always</code></a>'.</p>
<p>(We inherited this setting from <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdRestartUseDelay">the Ubuntu package's .service
unit, which got it from the Debian package</a>.
We don't use the Ubuntu package any more, but we used its .service
file as the starting point for ours, and it's broadly sensible to
automatically restart the host agent if something goes wrong.)</p>
<p>There are a surprisingly large number of things that you probably
won't notice going away briefly. If you don't look into the situation,
it might seem like a short connectivity blip, or even be hidden
from you by programs automatically retrying connections or operations.
Telling systemd to auto-restart these things will thus tend to hide
their crashes from you, which may be surprising. Still, auto-restarting
and hiding crashes is likely better than having the service be down
until you can restart it by hand. We certainly would rather have
intermittent, crash-interrupted monitoring of our machines than not
have monitoring for (potentially) some time.</p>
<p>Whether you want to monitor for this sort of thing (and how) is an
open question. It's certainly possible that this is <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/AlertsNeverComprehensive">one of the
times where your monitoring isn't going to be comprehensive</a>, because it's infrequent
enough, low impact enough, and hard enough to craft a specific
alert.</p>
<p>(I'm not certain if I'm going to bother trying to craft an alert
for this, partly because there's not quite enough information exposed
in the Prometheus host agent's systemd metrics to make it easy, or
at least for me to be confident that it's easy. You do get the
node_systemd_service_restart_total metric, which counts how
many times a Restart= is triggered, but that doesn't necessarily
say why and some things are restarted normally, such as 'getty'
services.)</p>
<p>Even if we don't add a specific alert, in the future I'm going to
want to remember to check for this when we're doing things like
rolling out a new version of a program (such as the Prometheus host
agent). It wouldn't hurt to look at the logs or the metrics, just
in case. Of course there's a near endless number of things you can
look at just in case, but <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SysadminAphorism">having stubbed my toe on this once</a> I may be more twitchy here for a while.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdRestartHidesProblems?showcomments#comments">5 comments</a>.) </div>Systemd auto-restarts of units can hide problems from you2024-02-26T21:43:53Z2023-08-01T02:04:34Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/LongTermSupportNoMoreFreecks<div class="wikitext"><p>One of the things that's quite popular with people out in the world
is being able to set up a Linux server and then leave it be for the
better part of a decade without having to reinstall it or upgrade
the distribution. <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CentOSUsageCases">I believe this is a significant reason people
used CentOS</a>, and it's popular enough to support
similar things in other distributions. <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZombieDistroVersions">I'm not fond of these old
zombie distribution versions</a>, but even we
have some of them (running CentOS 7). However, I'm broadly pessimistic
about people being able to get this for free in the future (<a href="https://mastodon.social/@cks/110695820822065790">cf</a>), and I'm also
pessimistic about even the current five year support period you get
for things like Canonical's Ubuntu LTS releases. To put it one way,
Red Hat's move is not unique; <a href="https://mastodon.social/@cks/110613666965100965">Canonical is monetizing Ubuntu too</a>.</p>
<p>The reality is that <a href="https://mastodon.social/@cks/110697536782034559">reliable backports of security fixes is
expensive</a> (partly
because <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/BackportsAreHard">backports are hard in general</a>).
The older a distribution version is, generally the more work is
required. To generalize somewhat, this work does not get done for
free; someone has to pay for it.</p>
<p>To date, this public good has broadly been provided for free for
various periods of time by Debian developers, Red Hat, Canonical,
and so on. <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CentOSStreamBigChanges">Red Hat's switch from 'CentOS' to 'CentOS Stream'</a> and now <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CentOSUsageCases">their change to how Stream works</a> marks Red Hat ceasing to provide this public
good for free; it's now <a href="https://mastodon.social/@cks/110757392256002113">fairly likely</a> to be a more or
less private, for pay thing. Canonical has never provided this
public good beyond five years (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/UbuntuBugReportsUseless">and in practice only to a limited
extent</a>), and now <a href="https://mastodon.social/@cks/110613666965100965">there are signs they're
going to limit this in various ways</a> (<a href="https://mastodon.social/@cks/110614121219974149">also</a>). Debian has sort
of provided this only <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DebianVsUbuntuForUs">semi-recently</a>, in the
form of <a href="https://wiki.debian.org/LTS">non-official five year support</a>
(and <a href="https://wiki.debian.org/LTS/Extended">extended paid support</a>).
I'm not sure about the practical state of openSUSE but <a href="https://en.opensuse.org/Lifetime">see their
lifetime page for the current claims</a>.</p>
<p>(Oracle claims to provide extended support for free but <a href="https://mastodon.social/@cks/110697536782034559">I don't
trust Oracle one bit</a>.)</p>
<p>People using Linux distributions have for years been in the fortunate
position that companies with money were willing to fund a lot of
painstaking work and then make the result available for free. One
of the artifacts of this was free distributions with long support
periods. My view is that this supply of corporate money is in the
process of drying up, and with it will go that free long term
support. This won't be a pleasant process.</p>
<p>The whole thing is <a href="https://mastodon.social/@cks/110695820822065790">why I said that people who wanted a decade of
free support would need good luck</a>. Maybe a way can
be found to squeeze through the roadblocks that the people providing
the money are trying to throw in the way (and the money will keep
flowing, because one end game is that Red Hat and Canonical exit
the long term Linux distribution business).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LongTermSupportNoMoreFree?showcomments#comments">2 comments</a>.) </div>On the future of free long term support for Linux distributions2024-02-26T21:43:53Z2023-07-28T03:02:38Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/CentOSUsageCasescks<div class="wikitext"><p>The news of the time interval is that <a href="https://www.redhat.com/en/blog/furthering-evolution-centos-stream">Red Hat has stopped making
Red Hat Enterprise Linux source code generally available</a>,
although just as with <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CentOSStreamBigChanges">the switch to 'CentOS Stream' from CentOS</a> their article doesn't put it that way.
This created difficulties for at least two CentOS replacement
distributions, <a href="https://almalinux.org/blog/future-of-almalinux/">forcing AlmaLinux to change what they are</a>. I don't have
much to say on this specific topic, but it has sparked a series of
exchanges about, for example, <a href="https://dissociatedpress.net/category/clone-wars/">the history of RHEL rebuilds</a> (<a href="https://funnelfiasco.com/blog/2023/07/14/ended-the-clone-wars-have/">via</a>). As
it happens, I have some views on why people would want to use a
free 'clone' (rebuild) of RHEL, as CentOS was before it became
CentOS Stream, partly based on personal experience.</p>
<p>Here are some major reasons people could want or need CentOS, at least
back in the era of CentOS, before CentOS Stream became your only option
from RHEL 8 onward:</p>
<ul><li>They want significantly more than the (maximum) five years of free
security updates and maybe bugfixes you can get elsewhere. I have
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZombieDistroVersions">negative views on old zombie distribution version</a>, but for a fixed function machine that
sits quietly in a corner, this is tempting; we have some of them
ourselves (our <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/ConsoleServerSetup">console server</a>
and our <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/CentralizeSyslog">central syslog server</a>).<p>
(<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WhyCentOSPlusUbuntuHere">We used to have more such machines</a>.)<p>
</li>
<li>They need something that is fully compatible with a Linux that
some commercial vendor's software works on, and the best option
the vendor has is Red Hat Enterprise Linux (<a href="https://mastodon.social/@cks/110695820822065790">for as long as that
lasts</a>).
Possibly they want something that is essentially guaranteed to
properly run any software that runs on RHEL, so they don't
have to worry about how specific software behaves.<p>
</li>
<li>They need something that is actively supported by a vendor for
some piece of commercial software, and the vendor will only qualify
and support exact duplicates of RHEL.<p>
We used to run a piece of commercial software for which the best
options listed and supported by the vendor were 'RHEL/CentOS 7'.<p>
</li>
<li>They want a free Red Hat Enterprise Linux.</li>
</ul>
<p>(<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CentOSStreamWhoFor">CentOS Stream is suitable for a different group of people</a>, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CentOSStreamSuitability">also</a>.)</p>
<p>The first reason doesn't need either 1:1 equivalence or ABI
compatibility. All our syslog server and console server need
is security updates.</p>
<p>The second reason needs at least ABI compatibility, but hopefully not
1:1 equivalence unless the software you're trying to run is extremely
picky. The less guarantees of ABI compatibility you have, the riskier
it is to use anything other than RHEL.</p>
<p>The third reason needs whatever the commercial vendor will support,
but typically the vendor is trying to minimize its costs and doesn't
want to test and qualify something that would be 'another Linux
distribution'. The vendor might accept ABI compatibility as good
enough, or it might decide that only 1:1 equivalence was low enough
risk to let it provide official support.</p>
<p>One might argue that people should be willing to pay for RHEL (or
another Linux distribution with long term security updates) in some
or all of these situations. I'm in a somewhat unusual situation,
but in general <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/UniversitiesFreeAttraction">'free' means that you don't need to get approval
for things</a>. If it costs your
organization something every time someone adds a new Linux server
or virtual machine running <distribution>, possibly even if it's a
temporary one, then there is far more friction than if such a machine
is free (<a href="https://mastodon.social/@cks/110617041808296577">for whatever reason, including a site license</a>).</p>
<p>To mildly react to a bit of Ben Cotton's <a href="https://funnelfiasco.com/blog/2023/07/14/ended-the-clone-wars-have/">Ended, the clone wars
have?</a>, my
view is that people who want the middle two reasons would still
want a 'CentOS' even in a world where RHEL development started off
in the new CentOS Stream model. It's possible that the new model
CentOS Stream would provide sufficient ABI compatibility and so on
for vendor software to work on it and even for vendors to support
it, but Red Hat doesn't seem to have promised that and I'm dubious
about it (<a href="https://mastodon.social/@cks/110757392256002113">even apart from the as far as I know open question of
CentOS Stream security updates</a>).</p>
<p>(And to the extent that some updates to CentOS Stream would be
rolled back or superseded before they appeared in RHEL, I think
there definitely would be people interested in 'CentOS Stream updates
but only once they've appeared in RHEL'.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CentOSUsageCases?showcomments#comments">One comment</a>.) </div>There's more than one reason that people used (or use) CentOS2024-02-26T21:43:53Z2023-07-27T02:48:32Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/IntelHyperthreadingSurprisecks<div class="wikitext"><p>Today <a href="https://mastodon.social/@cks/110736301909474551">I said something on the Fediverse</a>:</p>
<blockquote><p>Today my co-worker discovered that the SLURM job scheduler requires
your hyperthreading to be uniform across your CPU cores. Our latest
SLURM GPU nodes have Intel hybrid CPUs, which aren't uniform; they
have 24 cores but 32 threads total, because only the 8 performance
cores are hyperthreaded.</p>
<p>I guess we'll turn off hyperthreading. Thanks, Intel and SLURM.</p>
<p>(I'm sure people are going to discover much more fun with this.)</p>
</blockquote>
<p>These new GPU machines have <a href="https://www.intel.com/content/www/us/en/products/sku/230496/intel-core-i913900k-processor-36m-cache-up-to-5-80-ghz/specifications.html">Intel i9-13900K CPUs</a>.
Modern higher end Intel desktop CPUs have a split core model, with
a mix of better 'performance' cores and more power efficient
'efficient' cores. The 'efficient' cores are lower performance and
don't have hyperthreading. In the case of the i9-13900K, the split
is 8 performance and 16 efficient cores; with hyperthreading on,
you have 8 performance cores, 8 extra logical CPUs from the
hyperthreads on those cores, and then 16 efficient cores, for a
total of 32.</p>
<p>(See <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/IntelDesktopCPUsSMT">my entry on sorting out Intel desktop hyper-threading for more</a>. This Intel CPU quirk has actually been
around for some time.)</p>
<p>The <a href="https://man7.org/linux/man-pages/man1/lscpu.1.html">lscpu(1)</a>
information for this Intel CPU is a little hard to decode unless
you know what's going on:</p>
<blockquote><pre style="white-space: pre-wrap;">
CPU(s): 32
[...]
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
</pre>
</blockquote>
<p>According to 'lscpu -e', Linux logical CPUs 0 through 15 are the
performance cores, with successive logical CPUs being hyperthread
pairs (so 0 and 1 are the same core, 2 and 3 are the same core, and
so on). Logical CPUs 16 through 31 are 'efficient' cores with lower
maximum clock speeds. This pairing isn't always how (Intel)
hyperthreading is done; <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2018">my home desktop</a> has an
6 core hyperthreaded CPU, with the pairs being CPU 0 and 6, 1 and
7, and so on.</p>
<p>(I don't know what decides how this pairing works.)</p>
<p>It's not news that this non-uniform CPU distribution is likely to
cause heartburn for software; this is just <a href="https://support.cs.toronto.edu/">our</a> first encounter with it. That's
partly because these are probably our first machines with Intel's
non-uniform core and CPU structure. Future versions of <a href="https://en.wikipedia.org/wiki/Slurm_Workload_Manager">SLURM</a> will probably
be updated to deal with both the non-uniform hyperthreading and
perhaps the non-uniform CPU speeds.</p>
<p>It's worth noting that in theory you can already have non-uniform
hyperthreading on a system even without Intel doing weird things
in their CPUs. On a multi-socket server, you might wind up with
hyperthreading enabled on only one socket for some reason. It's
also possible to have <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CPUNumbersNotContiguous">non-contiguous Linux CPU numbers</a>, for example because you've offlined one
socket on a dual-socket machine and have hyperthreading on.</p>
<p>Since I looked it up, there are two ways to disable <a href="https://en.wikipedia.org/wiki/Simultaneous_multithreading">SMT (Simultaneous
multithreading), aka hyperthreads</a> in the
Linux kernel whether or not your BIOS supports doing so. First, you
can add '<code>nosmt</code>' to your <a href="https://www.kernel.org/doc/html/next/admin-guide/kernel-parameters.html">kernel command line parameters</a>.
Second, you can change it during startup by writing 'off' to
<a href="https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-system-cpu">/sys/devices/system/cpu/smt/control</a>,
which will also tell you the state of SMT on your systems. I don't
know what either option does to Linux's logical CPU numbering; if
you need (or want) sequential CPU numbering with SMT off, you may
need to disable SMT in the BIOS.</p>
<p>(This might be a sysfs file you want to check or monitor if for
some reason you need to be sure that SMT is disabled or not available
on your systems.)</p>
<p>PS: Another other option on these i9-13900Ks might be to offline
the efficiency cores and see if SLURM will be happy calling the
result a good old fashioned 8/16 socket. Since we're using these
as SLURM GPU nodes, where we traditionally don't care about the
CPU, losing the efficiency cores may not really matter.</p>
<p>(I'm aware that some GPU computation jobs still want plenty of CPU.
People with those sort of jobs probably won't be happy with <a href="https://support.cs.toronto.edu/">our</a>
SLURM GPU nodes in general, which are mostly not 'powerful machines
with GPUs' but instead 'a (once) decent GPU in any machine we can
put it in', although we did at least bring all of the GPU nodes up
to 32 GB of RAM.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/IntelHyperthreadingSurprise?showcomments#comments">One comment</a>.) </div>Non-uniform CPU hyperthreading is here and can cause fun issues2024-02-26T21:43:53Z2023-07-19T02:49:21Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/KernelSyscallTracingAndErrnocks<div class="wikitext"><p>Suppose, not entirely hypothetically, that you want to print out
some information about every <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/FlockFcntlAndNFS"><code>fcntl()</code> lock call</a>
that fails, system-wide. These days this is relatively easy to do
with <a href="https://github.com/iovisor/bpftrace">bpftrace</a>, especially
since there are <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#6-tracepoint-static-tracing-kernel-level-arguments">system call entry and exit tracepoints</a>.
However, you might reasonably wonder how the <a href="https://man7.org/linux/man-pages/man2/fcntl.2.html">fcntl(2)</a> system call
actually returns <code>errno</code>, the error code, and how this manifests
at the level of the sys_exit_fcntl syscalls tracepoint. As it
turns out, there's some tribal knowledge and peculiarities here.</p>
<p>First off, in most contexts inside the Linux kernel, errno values
are represented as negative values. If a call returns an error,
it will return, eg, '-ELOOP' (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PortingKernelCodeChallenging">this can be the source of interesting
bugs</a>). This is how errno is reported
in (most) system call exits, including for <code>fcntl()</code>. So the answer
is that in tracepoint:syscalls:sys_exit_fcntl in bpftrace,
args->ret will be below zero. You don't have the system call arguments
handy in the exit handler, but you can write something like this using
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/BpftraceStashingData">the pattern of capturing data for later</a>:</p>
<blockquote><pre style="white-space: pre-wrap;">
tracepoint:syscalls:sys_enter_fcntl
/args->cmd > 4/
{
@fd[tid] = args->fd;
@cmd[tid] = args->cmd;
@flag[tid] = 1;
}
tracepoint:syscalls:sys_exit_fcntl
/@flag[tid] != 0/
{
if (args->ret < 0) {
printf("FAIL: fcntl(%u, %u, ...) = %ld for '%s' PID %lu UID %lu\n", @fd[tid], @cmd[tid], args->ret, comm, pid, uid);
}
delete(@fd[tid]);
delete(@cmd[tid]);
delete(@flag[tid]);
}
</pre>
</blockquote>
<p>You can turn ordinary errno numbers into the relevant errno name with
the '<code>errno</code>' command, although you'll have to make them positive again:</p>
<blockquote><pre style="white-space: pre-wrap;">
$ errno 9
EBADF 9 Bad file descriptor
</pre>
</blockquote>
<p>However, if you run a bpftrace program like this for long enough
you may begin to see very odd reported errnos that are, for example
'-512'. The <code>errno</code> command will not tell you about these and you
won't find them listed in sources like <a href="https://man7.org/linux/man-pages/man3/errno.3.html"><code>errno(3)</code></a>. The reason
for this is that these are basically internal use errno codes, which
you can find listed in the kernel's <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/errno.h">include/linux/errno.h</a>.
The most common one I've seen is -512, which is ERESTARTSYS. As for
why I'm seeing them, I'll quote the comment in the file:</p>
<blockquote><p>These should never be seen by user programs. To return one of
ERESTART* codes, signal_pending() MUST be set. Note that ptrace
can observe these at syscall exit tracing, but they will never be left
for the debugged user process to see.</p>
</blockquote>
<p>Unsurprisingly, if <a href="https://man7.org/linux/man-pages/man2/ptrace.2.html">ptrace()</a> can see them,
so can kernel tracepoints. Whether or not you make your bpftrace
code skip over reporting them is up to you, but I'm probably going
to do that (since these values are never returned to user level).</p>
<p>As a side note, if I'm reading the kernel source code correctly,
ERESTARTSYS is handled basically by moving the user process's
instruction pointer back to the start of the system call, so that
when the kernel returns to the process, the process just makes the
system call again. See <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/signal.c#n299">arch_do_signal_or_restart()</a>
in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/signal.c">arch/x86/kernel/signal.c</a>.
This strikes me as simultaneously elegant and terrifying.</p>
<p>(This elaborates on <a href="https://mastodon.social/@cks/110707740921290456">a Fediverse post of mine</a>.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/KernelSyscallTracingAndErrno?showcomments#comments">One comment</a>.) </div>Some notes on errno when tracing Linux kernel system call results2024-02-26T21:43:53Z2023-07-14T02:51:53Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ProcLocksNotesIIIcks<div class="wikitext"><p><a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSv4BasicsJustWork">Now that I've tried out NFS v4 on Ubuntu 22.04</a>,
I have some additional notes on NFS v4 locks and /proc/locks, to go with
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ProcLocksNotesII">my earlier notes on NFS (v3) locks</a>. It turns out
that there are some changes from NFS v3 to NFS v4, at least on your NFS
server.</p>
<p>Here is what an exclusive POSIX lock looks like on an Ubuntu 22.04
NFS v4 server:</p>
<blockquote><pre style="white-space: pre-wrap;">
4: POSIX ADVISORY WRITE 704 fc:02:669294 0 EOF
5: DELEG ACTIVE READ 704 fc:02:669294 0 EOF
</pre>
</blockquote>
<p>The equivalent information from <code>lslocks</code> is:</p>
<blockquote><pre style="white-space: pre-wrap;">
nfsd 704 DELEG READ 0 0 0 /...
nfsd 704 POSIX WRITE 0 0 0 /...
</pre>
</blockquote>
<p>There are two obvious differences in this. First, the process ID
owning the lock is for one of your nfsd processes, not lockd. This
is likely because in NFS v4, locking is integrated into the NFS
protocol instead of being <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/NFSLocksStuckWorkaround">an additional set of protocols with
separate daemons</a>. The second is
that there is this strange 'DELEG ACTIVE' (pseudo-)lock. I believe
that this is a <a href="https://lwn.net/Articles/898262/">NFS v4 read delegation</a>,
where the server promises the client that it will be notified before
anyone else is allowed to write to the file. Read delegations
(usually) appear when a file is opened on the NFS client, not just
when you take a lock, but since Unix locking works on file descriptors,
you necessarily have to have an open file before you can get a lock.
These delegations may linger after the relevant process on the NFS
client has closed the file in question.</p>
<p>(Looking at the source code, it appears that a 'LEASE' type is also
possible in general, although I don't know if it appears in NFS v4.
The 'ACTIVE' status can also be 'BREAKING' or 'BREAKER'. All of
this is in the kernel's <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/locks.c">lock_get_status() in fs/locks.c</a>.)</p>
<p>The process ID may be less predictable than it is for NFS v3 locks,
since there are generally multiple nfsd processes and I'm not sure
if there is any consistency about which one becomes the owner of
any particular lock. You may need to check against every nfsd process
ID if you want to identify NFS v4 server locks.</p>
<p>Although I'm not completely sure what's going on in NFS v4, getting
a shared lock on a file opened for reading with with either <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/FlockFcntlAndNFS"><code>fcntl()</code>
or <code>flock()</code></a> produces no visible POSIX lock on
the NFS v4 server, just a DELEG READ entry. If you open the file
for writing and then take s shared lock, you do get a POSIX lock
visible on the NFS server. The shared lock still works (ie, it
prevents an exclusive lock from being acquired), so something is
going on.</p>
<p>Unlike the situation with NFS v3 locks, where <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerLockClients">you have to dig
into the kernel data structures to find the client who owns a lock</a>, it appears that the NFS v4 server directly
exposes this information in files under <a href="https://man7.org/linux/man-pages/man7/nfsd.7.html">/proc/fs/nfsd</a>. Based on casual
inspection, 'clients/<id>/states' appears to contain information
on delegations and locks from that client, while 'clients/<id>/info'
identifies the client. Actual locks in the '<code>states</code>' file are
'type: lock', as opposed to the other types (which may appear in
quantity, due to delegations).</p>
<p>(A number of states show up in the 'states' file and I don't know
enough about NFS v4 right now to understand them, or how the states
change as you do various things.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ProcLocksNotesIII?showcomments#comments">One comment</a>.) </div>Notes on Linux's <code>/proc/locks</code> and NFS v4 locks as of Ubuntu 22.042024-02-26T21:43:53Z2023-07-11T02:42:35Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NftablesUbuntu2204Experiencecks<div class="wikitext"><p>A while back I wrote about <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFTablesInoffensive">how I'd now used nftables on a new
machine and it was okay</a>. This came about
because Ubuntu 22.04's default setup is that the 'iptables' command
is actually a frontend for <a href="https://wiki.nftables.org/wiki-nftables/index.php/Main_Page">nftables</a>, and
when I noticed that I decided that I might as well write nftables
rules directly for this. Today <a href="https://mastodon.social/@cks/110675784421138391">I had cause to remember this</a>, and also to
reflect on our other uses of nftables on Ubuntu 22.04. These other
uses came about because we have various machines (such as <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our
fileservers</a>) that use firewall rules that
are set up with the 'iptables' command, and in some cases also
removed by it. Since in 22.04 the iptables command is actually using
nftables, that means those machines silently started using nftables
when we upgraded them to 22.04.</p>
<p>The good news is that everything just worked. Until I was thinking
about it today, it didn't even strike me that these various machines
were now using nftables; absolutely nothing changed that we'd
noticed. All of our setup and management scripts kept working as-is,
and the actual rules kept working. Our 'iptables' rules include
both straight firewall access control rules and some NAT rewriting
rules (on different machines); some of the firewall rules use ipsets,
a few use <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LinuxIpFwmarkMasks">firewall marks and masks</a> along
with sub-chains, and others are applied only temporarily and then
deleted later. This doesn't cover all of the various iptables command
line options and rules, but it's a reasonable large amount of what
I'd expect to use under normal circumstances.</p>
<p>However, this splits our experience into two separate and distinct
buckets. On the one hand, we've directly used nftables with a static
configuration written down in /etc/nftables.conf. On the other hand,
we've indirectly used nftables through the iptables command with
dynamic configurations. We haven't tried to do dynamic things
directly with the <a href="https://www.netfilter.org/projects/nftables/manpage.html">'nft'</a> command,
or to mix a static initial configuration from /etc/nftables.conf
with later dynamic modifications from either 'nft' or 'iptables',
so I have no idea how well either would work. Although since the
22.04 'iptables' command is just a compatibility layer over nftables,
you can clearly do dynamic rule modifications with nftables in
general.</p>
<p>My current view is that if I was to write rules for some system
from scratch in an environment like Ubuntu 22.04, I would directly
use nftables and /etc/nftables.conf for a static configuration that
I expected to reload if I ever changed things. However, if I had a
dynamic configuration where I had to add and delete rules on the
fly, I would stick with using the 'iptables' command (and its syntax
and handling of rules, sub-chains, and so on) rather than try to
master using <a href="https://www.netfilter.org/projects/nftables/manpage.html">'nft'</a> for this. I'm sure that someday I'll need
to learn dynamic use of 'nft', but not today.</p>
<p>(In theory we have some completely static firewall rules created
through 'iptables', so we could run the iptables commands, use '<code>nft
list ruleset</code>' to dump the nftables translation, create an
/etc/nftables.conf from that dump, and switch over to setting up
the rules natively through nftables. In practice we're not going
to do this for already-installed machines, and we may not remember
to do this even when we next have to rebuild them under a new Ubuntu
version.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NftablesUbuntu2204Experience?showcomments#comments">One comment</a>.) </div>Our experience with nftables and 'iptables' on Ubuntu 22.042024-02-26T21:43:53Z2023-07-08T03:33:54Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NFSv4BasicsJustWorkcks<div class="wikitext"><p>I've been saying grumpy things about NFS v4 for a fairly long time
now, and in response for a while people have been telling me that
these days NFS v4 can look basically just like NFS v3. You can have
your traditional Unix permissions model (the <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/NFSVsNFSWithKerberos">NFS without Kerberos
one</a>) and you don't have to reorganize
your exports and so on. Recently I decided to give it a try on some
scratch virtual machines running our standard Ubuntu 22.04 LTS
setup, and to my pleasant surprise it does seem to just work.</p>
<p>To test, I installed Ubuntu's NFS server package, made a scratch
directory in the same place we'd use for <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">a real ZFS filesystem
on a fileserver</a> (which is not under /exports),
put in exactly the same export options and permissions in
/etc/exports.d/<file>.exports (including 'sec=sys'), and NFS mounted
it on a test NFS client. Then I used it on the client as both a
regular user and as 'root', testing with root squashing on (our
normal setup) and off (used for some filesystems). All of this
worked, with none of the various glitches that have happened to us
in the past when we tried this sort of thing.</p>
<p>Part of the reason it worked this transparently is that the client
and the server both had our standard /etc/resolv.conf and had their
hostnames in a standard format (and have fully qualified domain
names in the same subdomain). My understanding is that this matters
because for 'sec=sys', NFS v4 clients and servers need to agree on
a <em>NFS v4 domain name</em> to insure that login 'fred' on the client
is the same as login 'fred' on the server. This 'domain name' can
be set explicitly in <a href="https://linux.die.net/man/5/idmapd.conf">idmapd.conf(5)</a>, but if you don't do this
it's derived from the DNS domain names of the hosts involved. In a
production deployment, we'd probably want to set this specifically
in idmapd.conf just to avoid problems.</p>
<p>I suspect that there are other traps in actual use. One thing I've
already noticed is that the kernel client code doesn't appear to
log any messages if a NFS v4 server stops responding, unlike with
NFS v3. These messages are useful for us for tracking NFS server
problems and seeing when they start to go away. Possibly there's
other signals we can tap into.</p>
<p>My interest is because NFS v4 seems to be better regarded in
general and especially for file locking (which is integrated
into the protocol in NFS v4 but is a separate thing in NFS v3).
My impression is that the Linux kernel NFS people would rather
you use NFS v4, and so NFS v4 is likely to get more bugs fixed
and so on in the future. (Possibly this is incorrect.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSv4BasicsJustWork?showcomments#comments">2 comments</a>.) </div>Basic NFS v4 seems to just work (so far) on Ubuntu 22.042024-02-26T21:43:53Z2023-07-07T02:43:35Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/PackageBuildingTwoIsolationLevelscks<div class="wikitext"><p>Today I once again had to rebuild an Ubuntu package from source,
and <a href="https://mastodon.social/@cks/110533814706500891">once again it didn't go well</a>. This gives me a
good opening to talk about the two sorts of build isolation you
want when building or re-building packages for your Linux distribution.</p>
<p>The first sort of isolation is <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DebianSourcePackageProblemsII">isolation of the binary build area
from the package source area</a>, which
the Debian package format doesn't have; <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DebianSourcePackageBuildExplosion">the lack of this isolation
can easily cause explosions</a>.
Without this isolation, repeatedly building the package is dangerous
all by itself; a second build may fail outright or be quietly
contaminated by artifacts and changes from the previous build. The
Debian package build process at least checks for this and will abort
under the right circumstances, saving you from potential problems.
By contrast, the RPM build process normally separates these into
an area for the package source and a separate area where the package
is built, with the build area recreated from scratch every time.</p>
<p>(At the time I set up my RPM configuration, the default RPM setup
of package source wasn't ideal because it could comingle components
of all packages together. These defaults may have changed since
then.)</p>
<p>The second sort of isolation is isolation of the entire build process
from your regular user environment and your system's particular set
of installed packages (or packages that aren't installed). This is
sometimes called a <em>hermetic build environment</em> (or hermetic builds),
because the build is completely sealed away from the outside. Without
hermetic builds, your environment variables, the state of your $HOME
and any dotfiles or other configuration in it, the versions of
things on your particular $PATH, and so on may all influence the
package you build, for better or worse. A hermetic build environment
provides consistency and often makes it easier to re-do or reproduce
your (re)build later.</p>
<p>(As a side effect, hermetic builds force packages to relatively
accurately describe their build time dependencies and requirements,
because otherwise the dependency probably won't be there. I say
'probably' because sometimes a build dependency that you didn't
explicitly specify can be helpfully pulled in indirectly by a build
dependency that you did require.)</p>
<p>Neither RPM nor Debian packages provide hermetic builds out of the
box. For RPMs, <a href="https://linux.die.net/man/1/mock">mock</a> provides
an all-in-one solution that's generally very easy to use. Debian
has <a href="https://wiki.debian.org/sbuild">the sbuild collection of tools</a>
(<a href="https://wiki.debian.org/Packaging/sbuild">also</a>, <a href="https://manpages.debian.org/unstable/sbuild/sbuild.1.en.html">sbuild(1)</a>)
that, based on my reading, provide the tools you need to do this
(I only recently found out about sbuild and haven't tried to use
it). If there is a convenient mock-like front end to sbuild and its
other tools, I haven't spotted it in Internet searches so far.
Ubuntu does have a <a href="https://packaging.ubuntu.com/html/setting-up-sbuild.html">Setting up sbuild</a>
document that makes it look fairly straightforward.</p>
<p>The ideal situation is that hermetic isolation is as fast and
convenient to use as the simpler source versus build area isolation,
so you can use it all the time. Otherwise, if you have both it's
not uncommon to first develop your change by repeatedly building
the package the fast way, and then do the final, for-real build
with the slower hermetic isolation.</p>
<p>(When working with RPMs, I've been known to not even build and
install the binary RPMs; instead I'll have <a href="https://man7.org/linux/man-pages/man8/rpmbuild.8.html">rpmbuild</a> stop after
compiling everything, and then I run the compiled binaries out of
the build area. This doesn't work for everything, but it can be
quite convenient.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PackageBuildingTwoIsolationLevels?showcomments#comments">3 comments</a>.) </div>There are two levels of isolation when building Linux packages2024-02-26T21:43:53Z2023-06-13T02:47:20Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSOnLinuxRisksWithNFScks<div class="wikitext"><p>I've written in the past about <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/MaildirNotGoodWithNFS">how we've had problems with Maildir
format mail storage</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/LocalVarMailImprovement">how
making /var/mail local to our IMAP server was a significant improvement</a>. A common element to both of
these issues that <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our NFS fileservers use (Open)ZFS On Linux</a>. Over time, I've come to feel that this
represents a potential risk factor for our environment.</p>
<p>Now, on the one hand we might have seen these issues with NFS
regardless of the underlying filesystem on the fileservers, even
with a well supported filesystem like ext4. On the other hand, we've
seen other NFS performance oddities with our NFS fileservers, and
ZFS is an unusual 'filesystem' that may interact with NFS IO in odd
ways. Unlike many filesystems, ZFS has large scale structures that
are used to aggregate IO (in the form of ZFS pools) and it doesn't
really present this to the kernel in any way that's legible to the
rest of the kernel. And my guess is that NFS serving with ZoL is
less common than other uses of ZoL (partly because NFS is getting
less common in general).</p>
<p>With a convention NFS server filesystem stack, such as ext4 on LVM
on software RAID, everything is in the kernel and you can ask kernel
people for help, report issues you see, and so on. If something is
going wrong that creates sub-par performance, the kernel people
will probably want to fix it. But (Open)ZFS On Linux is outside the
kernel, so Linux kernel people have little reason to particularly
help out and ZoL people may not have the capabilities to dig into
the kernel NFS and disk IO stacks to understand what's going on
(it's a bit out of scope), and even if a problem can be identified
there may not be any good fix. One reason for this is that the
actual code of ZFS On Linux is also mostly Solaris/Illumos code,
which creates a mismatch between the kernel and ZFS (one of the
areas where this is still quite visible is <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCTargetSizeChanges">memory issues with
ZFS's ARC</a>).</p>
<p>This sort of thing is probably not a big risk for most people. ZFS
On Linux is highly likely to always be functional and quite likely
to always perform well in ordinary circumstances, since the first
is absolutely necessary and the second is quite popular. Our issues
are performance issues under what appears to be significant load.
Most people don't push their systems that hard (I don't on my
desktops, where I use ZFS On Linux without particular performance
issues).</p>
<p>Even with this risk, ZFS On Linux is more than worth it for us in
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our environment</a>. We get various sorts of
benefits from using ZFS that would be hard to replicate with any
other setup, and the performance we get is good enough. Everything
is a tradeoff. But the risk is something I want to be honest about.
It's also something I want to keep in mind if we see performance
oddities in the future, or are planning something that needs high
IO performance.</p>
</div>
The potential risks of using (Open)ZFS On Linux with at least NFS2024-02-26T21:43:53Z2023-06-12T02:11:47Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/KernelArgvFixupcks<div class="wikitext"><p>Suppose, not entirely hypothetically, that your Linux kernel has
logged a kernel message to the effect of:</p>
<blockquote><pre style="white-space: pre-wrap;">
process 'syscall.test' launched '/dev/null' with NULL argv: empty string added
</pre>
</blockquote>
<p>(This one was triggered by building Go from source.)</p>
<p>In a conventional call to <a href="https://man7.org/linux/man-pages/man2/execve.2.html">execve(2)</a>, the <code>argv</code>
argument is a pointer to an array that will become the executed
program's <code>argv</code>, with the array terminated with a NULL element (in
the grand C fashion, there is no explicit 'length' parameter passed).
The first (0th) element of this array is the nominal name of the
program and the remainder are the command line arguments. Since all
programs have some name, this array is normally at least one element
long. However, the execve(2) interface (plus C) allows for two
additional variations on the value of argv here.</p>
<p>First, you can pass in a zero-length argv (ie, where argv[0] is
NULL), in which case the executed program will have an argv[0] that
is NULL. A variety of programs will then be unhappy with you, as
people discovered in <a href="https://blog.qualys.com/vulnerabilities-threat-research/2022/01/25/pwnkit-local-privilege-escalation-vulnerability-discovered-in-polkits-pkexec-cve-2021-4034">CVE-2021-4034</a>.
This option exists more or less <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/Argv0IsEasy">because this API was easy back
in the old days of Unix</a>. Second, you can
pass in a NULL <code>argv</code> argument to execve(2). This has the same net
effect (the exec'd program has no arguments and no argv[0] name for
itself), but it's probably even more unexpected since you can't
even dereference argv to check argv[0]. Probably any number of
programs will fault at this point (although they have a chance if
they check argc first, since argc will be 0 here).</p>
<p>What this Linux kernel message is saying is that the kernel detected
an execve(2) with either a NULL argv or a zero-length argv, and
it's changing the situation by adding an empty string as argv[0].
The specific change dates to early 2022, in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=dcd46d897adb70d63e025f175a00a89797d31a43">exec: Force single
empty string when argv is empty</a>,
and was first included (in the mainline) in kernel 5.18. The commit
message has a long and informative discussion, and in fact this is
a reaction to <a href="https://blog.qualys.com/vulnerabilities-threat-research/2022/01/25/pwnkit-local-privilege-escalation-vulnerability-discovered-in-polkits-pkexec-cve-2021-4034">CVE-2021-4034</a>.</p>
<p>This particular message is produced only once per kernel boot, so
you're probably not going to see it very often. Since I build Go
from source regularly, this is reassuring.</p>
<p>(Although the message talks about 'with NULL argv', it really means
that there are no arguments; you get the same message if you call
execve(2) with a zero-length argv array as if you call it with a
genuinely NULL argv.)</p>
</div>
The Linux kernel will fix some peculiar argv usage in execve(2)2024-02-26T21:43:53Z2023-06-06T02:48:09Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/GNUGrepVersusEcologycks<div class="wikitext"><p>One of the changes in GNU Grep 3.8 was, to quote <a href="https://lists.gnu.org/archive/html/info-gnu/2022-09/msg00001.html">this release
notice</a>
(also the <a href="https://savannah.gnu.org/news/?id=10191">GNU Grep 3.8 release NEWS</a>):</p>
<blockquote><p>The egrep and fgrep commands, which have been deprecated since release
2.5.3 (2007), now warn that they are obsolescent and should be
replaced by grep -E and grep -F.</p>
</blockquote>
<p>GNU Grep's <code>fgrep</code> and <code>egrep</code> commands were already shell scripts
that ran '<code>grep -F</code>' or '<code>grep -E</code>', so this change amounted to
adding an <code>echo</code> to them (to standard error). Many Linux distributions
immediately reverted this change (for example, Debian), but Fedora
did not and so Fedora 38 eventually shipped with Grep 3.8. Fedora
38 also shipped with any number of open source packages that contain
installed scripts that use '<code>fgrep</code>' and '<code>egrep</code>' (<a href="https://bugzilla.redhat.com/show_bug.cgi?id=2188430#c4">cf what I
found on my machine</a>), and
likely more of its packages use those commands in their build
scripts.</p>
<p>(There are <a href="https://bugs.gentoo.org/show_bug.cgi?id=868384">reports of build failures in Gentoo</a> (<a href="https://bugzilla.redhat.com/show_bug.cgi?id=2188430">via</a>).)</p>
<p>Since <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/FatalWarnings">adding warnings and other new messages is a breaking API
change</a>, all of these packages are now
broken in Fedora and by extension any other Linux distribution that
packages them, uses GNU Grep 3.8 or later, and hasn't reverted this
change. Some of them are only minorly broken; others, either
inspecting their standard error or operating in a context where
other programs expect to see and not see some things, are more
seriously affected. To repair this breakage, all of these packages
need to be changed to use '<code>grep -F</code>' and '<code>grep -E</code>' instead of
<code>fgrep</code> and <code>egrep</code>.</p>
<p><strong>This change is pointless make-work inflicted on the broad open
source ecology by GNU Grep</strong>. GNU Grep's decision to cause <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/EgrepFgrepStuckWith">these
long-standing commands</a> to emit new
messages requires everyone else to go through making changes in
order to return to the status quo. This is exactly the same kind
of make work as other pointless API changes, and just like them
it's hostile to the broad open source ecology.</p>
<p>(It's also hostile to actual people, but <a href="https://mastodon.social/@cks/110232679198344609">that's another topic</a>.)</p>
<p>You may be tempted to say 'but it's a small change'. There are two
answers. First, a small change multiplied by a large number of open
source projects is a lot of work overall. Second, that this is a
make-work change at all is GNU Grep deciding that other projects
don't matter that much. This decision is hostile to the wider open
source ecology as a matter of principle. It's especially hostile
given that any number of open source projects are at best dormant,
although still perfectly functional, and thus not likely to make
any changes, and other open source projects will likely tell GNU
Grep to get bent and not change (after all, even Linux distributions
are rejecting this GNU Grep change).</p>
<p>Due to how Linux distribution packaging generally works, it would
actually have been less harmful for the overall Linux distribution
ecology if GNU Grep had simply dropped their '<code>fgrep</code>' and '<code>egrep</code>'
cover scripts. If they had done so, Linux distributions would most
likely have shipped their own cover scripts (without warnings) as
additional packages; instead, GNU Grep has forced Linux distributions
to patch GNU Grep itself.</p>
<p>PS: While GNU Grep is in theory not Linux specific, in practice
only Linux uses GNU Grep. Other open source Unixes have their own
versions of the grep suite, and this GNU Grep change isn't going
to encourage them to switch.</p>
<p>(<a href="https://mastodon.social/@cks/110232377928840323">I had a string of Fediverse reactions to this change when I
upgraded to Fedora 38 on my work machine</a>. Also, when
GNU Grep released 3.8 last fall I wrote about <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/EgrepFgrepStuckWith">how we're stuck
with egrep and fgrep</a>.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/GNUGrepVersusEcology?showcomments#comments">3 comments</a>.) </div>GNU Grep versus the (Linux) open source ecology2024-02-26T21:43:53Z2023-06-03T02:23:24Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/BpftraceStashingDatacks<div class="wikitext"><p>When using <a href="https://bpftrace.org/">bpftrace</a>, it's pretty common
that not all of the data you want to report on is available in one
spot, at least when you have to trace kernel functions instead of
tracepoints. When this comes up, there is a common pattern that you
can use to temporarily capture the data for later use. To summarize
this pattern, it's to save the information in an <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#3--associative-arrays">associative array</a>
that's indexed by the thread id to create <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#22-per-thread">a per-thread variable</a>.
If you have more than one piece of information to save, you use
more than one associative array.</p>
<p>Let's start with the simplest case; let's suppose that you need
both a function's argument (available when it's entered) and its
return value (so you can report only on successful functions).
Then the pattern looks like this:</p>
<blockquote><pre style="white-space: pre-wrap;">
kprobe:afunction
{
// record argument into @arg0
// under our thread id (tid)
@arg0[tid] = (struct something *)arg0;
}
// only act if we have the argument
// recorded
kretprobe:afunction
/@arg0[tid] != 0/
{
$arg = @arg0[tid];
printf(...., $arg) // or whatever
// clean up recorded argument
delete(@arg0[tid]);
}
</pre>
</blockquote>
<p>This example shows all of the common pieces. At the start, we capture
the function argument we care about into an <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#3--associative-arrays">associative array</a>
that's indexed by the current thread ID (using <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#1-builtins">the <code>tid</code> builtin
variable</a>),
then, provided that we have a recorded argument we use it when the
function returns. At the end, we clean up our associative array by
deleting our entry from it; if we didn't do this, we might have an
ever-growing associative array (or arrays) as different threads
called the function we're tracing. Incidentally, one time we might
invoke the kretprobe probe without the argument recorded is if we
start tracing while an existing invocation of the function is in
flight (which may be especially likely for functions that take a
while, such as handling a NFS request and reply).</p>
<p>(This pattern is so common it's mentioned in the documentation as
<a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#22-per-thread">a per-thread variable</a>. Note that the documentation's example
<code>delete()</code>s the per-thread entry just as I do here.)</p>
<p>The reason we didn't use a simple <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#21-global">global variable</a>,
as I did when I was recording ZFS's idea of available memory (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/BpftraceGrabbingData">in
another bpftrace trick</a>) is that multiple
threads may be calling this function at the same time, and if they
are, using a single global variable is obviously going to give us
bad results.</p>
<p>Another case that often comes up is that the function we want to
trace directly or indirectly calls another function that looks up
important information, for example to map some opaque identifier
into a more useful piece of data (a string, a structure) and return
it. A variant of this is where the function will generate the
information we want through a process that we can't hook into, but
will then call another function to validate it or act on it, at which
point we can grab the data. The full version of this pattern looks
something like this:</p>
<blockquote><pre style="white-space: pre-wrap;">
// set a marker so we know to save info
kprobe:afunction
{
@aflag[tid] = 1;
}
// if we're marked, save the information
kprobe:subfunction
/@aflag[tid] != 0/
{
@magicarg[tid] = arg0;
}
// if we have saved information, use it
// and clear it
kretprobe:afunction
/@magicarg[tid] != 0/
{
.... do whatever ...
delete(@magicarg[tid]);
}
// clear the marker
kretprobe:afunction
/@aflag[tid] != 0/
{
delete(@aflag[tid]);
}
</pre>
</blockquote>
<p>One reason we need to set a marker and only save the subfunction's
information if we're marked is that the marker is our guarantee
that the saved information will be cleared later. If we unconditionally
saved the information when subfunction() was called but only cleared
it when subfunction() was called by afunction(), that would lead
to a slow growth of dead <code>@magicarg</code> entries if subfunction() is
ever called from anywhere else.</p>
<p>A variant on this is if our 'subfunction' is actually a peer function
to our function of interest (and gets called before it), with both
being called from a containing function. The pattern here is more
elaborate; the containing function sets the marker and must clean
up everything, with the subfunction and our function saving and
using the information.</p>
<h3>Sidebar: Tracking currently active requests/etc in bpftrace</h3>
<p>In DTrace, the traditional way to keep a running count of something
(such as how many threads were active inside <code>afunction()</code>) was to
use a map with a fixed key that was incremented with <code>sum(1)</code> and
decremented with <code>sum(-1)</code> (see <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#map-functions">map functions</a>),
with the decrement generally guarded so that you knew a matching
increment had been done. Although I haven't tested it, the bpftrace
documentation on <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#10--and----increment-operators">the ++ and -- operators</a>
seems to imply that these are safe to use on at least maps with
keys (including constant keys), and perhaps global variables in
general. Even if you have to use maps, this is at least clearer
than the <code>sum()</code> version.</p>
<p>(You'll want to guard the decrement even if you use --.)</p>
</div>
Capturing data you need later when using bpftrace2024-02-26T21:43:53Z2023-06-02T03:06:34Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/BpftraceGrabbingDatacks<div class="wikitext"><p>When I talked about <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DrgnVersusEBPFTools">drgn versus bpftrace</a>,
I mentioned that one issue with bpftrace is that it doesn't have
much access to global variables in the kernel (and things that they
point to); at the moment it seems that bpftrace can only access
(some) global variables in the main kernel, and not global variables
in modules. However, often the information you may want to get is
in module global variables, for example <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerLockClients">the NFS locks that the
kernel NFS server is tracking</a> or <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCTargetSizeChanges">important
state variables for changes in the ZFS ARC target size</a>. When you want to get at these,
you need to resort to a number of tricks, which all boil down to
one idea: <strong>you find a place where what you want to know is exposed
as a function argument or a function return value</strong>, because bpftrace
has access to both of those.</p>
<p>(All of this means that you're going to need to read the kernel
source, specifically the kernel source for the version of the kernel
you're using, since the internal kernel structure changes over time.)</p>
<p>If you're really lucky, a function or kernel tracepoint that you
already want to track will be passed the information you're interested
in. This is unfortunately relatively rare, probably because there's
usually no point in passing in an argument that's already available
as a global variable.</p>
<p>Sometimes, you'll be able to find something that is called once on
each item in a complex global data structure, which will let you
indirectly see that global data structure. This was the case with
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerLockClients">bpftrace dumping of NFS lock clients</a>, which
also illustrates that you may need to do something to trigger this
traversal (here, reading from /proc/locks). In general, files in
/proc often have a kernel function that will produce one line of
them and are given as an argument something they're reporting about.</p>
<p>Some kernel code is generalized by calling a function to obtain
information that's effectively from a global variable (or something
close to it). For example, ZFS on Linux has an idea of 'memory
available to ZFS' that's a critical input to decisions on the ZFS
ARC size, and this number is obtained by calling the function
'<code>arc_available_memory()</code>'. If we want to know this value in
other functions (for example, the ZFS functions that decide about
shrinking the ARC target size), we can capture the information
for later use:</p>
<blockquote><pre style="white-space: pre-wrap;">
kretprobe:arc_available_memory
{
$rv = (int64) retval;
@arc_available_memory = $rv;
}
</pre>
</blockquote>
<p>Here I'm capturing this information in a global bpftrace value,
because it truly is a global piece of information. ZFS may call
this function in many contexts, not just when thinking about
shrinking the ARC target size, but all we care about is having
it available later so the extra times we'll update our bpftrace
global generally don't matter.</p>
<p>There are two unfortunate limitations of this approach, due to how
the kernel is structured. First, some of what look like function
calls in the kernel source code are actually #define'd macros in
the kernel header files; you obviously can't hook into these with
bpftrace. Second, some functions are inlined into their callers,
often because they've specifically been marked as 'always inline'.
These functions can't be traced either, which can be a pity because
they're often exactly the sort of access functions that'd give us
useful information.</p>
<p>(There are some general bpftrace techniques for picking up
information that you want, but they're for another entry.)</p>
<p>PS: I believe that bpftrace can access CPU registers (and thus the
stack) and can insert tracepoints inside functions, not just at
their start. In theory with enough work this would allow you to get
access to any value ever explicitly materialized at some point in
a function (either in a register or in a local on the stack). In
practice, this would be at best a desperation move; you'd have to
disassemble code in your specific kernel to determine instruction
offsets and other critical information in order to pull this off.</p>
<p>PPS: In theory with sufficient work you might be able to get access
to module global variables in bpftrace. Their addresses are in
/proc/kallsyms and I think you might be able to insert that address
into a bpftrace script, then cast it to the relevant (pointer) type
and dereference it. But this is untested and again I wouldn't want
to do this in anything real.</p>
</div>
Some tricks for getting the data you need when using bpftrace2024-02-26T21:43:53Z2023-05-31T01:59:49Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ModernDesktopEnvironmentscks<div class="wikitext"><p>Recently I read <a href="https://blog.nicco.love/kde-plasma-is-not-a-desktop-environment/">KDE Plasma is NOT a Desktop Environment</a> (<a href="https://lobste.rs/s/odpb2i/kde_plasma_is_not_desktop_environment">via</a>),
which maintains that it's more like an environment construction
kit, out of which one could build multiple environments. I have
some reactions to this, and also I have some opinions on what a
desktop environment even is on a modern Linux system (opinions
which may count as a bit heretical).</p>
<p>The <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/DesktopsAlwaysThere">classical Unix vision of a desktop environment</a> is that it's basically a window manager
and a suite of graphical applications built around a common look
and feel, usually using a common GUI library/toolkit. These GUI
applications will usually include a file manager and often include
various other productivity applications. Although you sort of have
this in GNOME and KDE, this is not really what a desktop environment
needs to do today on Linux.</p>
<p>On modern Linux, a usable graphical experience has a lot of moving
parts, many of which the person using it expects to manage through
a GUI. It needs things like an audio system, a system to handle
removable media, widgets to log out, lock the screen, and reboot
the system, integration with network management, a central preferences
management system that applies to all of 'its' applications and
really wants to ripple through to applications using other toolkits,
and the ability to handle things like additional screens showing
up or people wanting to change the screen resolution (which you
need to auto-detect). As it happens, there are relatively well
defined systems to handle many of these jobs (and more), and often
relatively well defined means of talking to them through <a href="https://en.wikipedia.org/wiki/D-Bus">D-Bus</a>.</p>
<p>(For instance, the modern Linux audio experience is mostly based on
<a href="https://pipewire.org/">PipeWire</a>, at least at the moment.)</p>
<p>A modern desktop environment is something that supplies and integrates
all of those pieces and moving parts to provide an experience where
everything 'just works', where audio plays when you want it to and
you have an on-screen volume slider, where you can click on a widget
to control your VPN (or get the ability to configure a new one),
and so on. It probably comes with some applications of its own to,
for example, handle its preferences system and things like window
management keyboard shortcuts, but many applications that would
previously have been considered part of the desktop environment are
outsourced now. Almost everyone is going to use LibreOffice and
either Firefox or Chrome, for example, and there is broadly no need
to reimplement things like a terminal emulator (although a desktop
can if it wants to).</p>
<p>You can of course build such a desktop environment yourself, with
sufficient work. There are window managers, taskbars, status bars,
applets, launchers, things to parse .desktop files to create nice
launcher menus, and so on and so forth, and you can assemble them
into a working configuration. But there is an exhaustingly large
amount of work (and it keeps churning), so at a certain point most
people give up doing it themselves, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CustomLaptopEnvironmentIII">as I did when I started using
Cinnamon on my laptop</a>. <strong>A modern Linux
desktop environment is a system integrator</strong>; it collects all of
the pieces and connects them up so that you don't have to learn how
to do it yourself (and then find or write programs that do the
work).</p>
<p>For historical reasons, the two largest such integrators (GNOME and
KDE) come with their own GUI look and feel, implemented by their
own toolkits, and a variety of core and third party applications
that use their toolkits and thus their look and feel. But this is
not essential. <a href="https://en.wikipedia.org/wiki/Cinnamon_(desktop_environment)">Cinnamon</a>
reuses a lot of GNOME pieces, while <a href="https://www.xfce.org/">XFCE</a>
has a relatively modest <a href="https://www.xfce.org/projects">set of applications</a> and while it has its own toolkit,
I don't think it's widely used by third party programs. But XFCE
is still a full scale modern desktop environment, because it does
all of that hard integration work for you, and you can just use it.</p>
<p>(As far as I know no one has attempted to write down in one place
(or maintain a set of links to) everything that you need to support,
connect together, run as part of your session, send D-Bus messages
to, listen to D-Bus messages from, and so on. Even if someone managed
that heroic feat, keeping it up to date would be an ongoing job,
never mind trying to suggest programs and configurations to implement
it all.)</p>
</div>
What a desktop environment is on modern Linux2024-02-26T21:43:53Z2023-05-19T02:44:18Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/KernelIntegersToTextThoughtcks<div class="wikitext"><p>Over on the Fediverse, <a href="https://mastodon.social/@cks/110345883523920791">I said something recently</a>:</p>
<blockquote><p>I sometimes think about all the CPU cycles that are used on Linux
machines to have the kernel convert integers to text for /proc and
/sys files and then your metrics system convert the text back to
integers. (And then sometimes convert the integers back to text when
it sends them to the metrics server, which is at least a different
machine using CPU cycles to turn text back into integers (or floats).)</p>
<p>It's accidents of history all the way down.</p>
</blockquote>
<p>We run <a href="https://github.com/prometheus/node_exporter">the Prometheus host agent</a> on all of our Linux
machines. Every fifteen seconds <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">our Prometheus server</a> pulls metrics from all
the host agents, which causes the host agent to read a bunch of
/proc files (for things like memory and CPU state information) and
/sys files (for things like hwmon information). These status files
are text, but they contain a lot of numbers, which means that the
kernel converted those integers into text for us. The host agent
then converts that text back into numbers internally (I believe a
mixture of 64-bit integers and 64-bit floats), only to turn around
and send them to the <a href="https://prometheus.io/">Prometheus</a> server
as text again (see <a href="https://prometheus.io/docs/instrumenting/exposition_formats/">Exposition Formats</a>,
<a href="https://github.com/OpenObservability/OpenMetrics/blob/main/legacy/markdown/protobuf_vs_text.md">also</a>).
On the Prometheus server these text numbers will be turned back
into floats. All of this takes CPU cycles, although perhaps not
many CPU cycles on modern machines.</p>
<p>(The host agent gets some information from the Linux kernel through
methods like <a href="https://man7.org/linux/man-pages/man7/netlink.7.html">netlink</a>, which I
believe transfers numbers in non-text form.)</p>
<p>All of the steps of this dance are rational ones. Things in /proc
and /sys use text instead of some binary encoding because text is
a universal solvent on Unix systems, and that way no one had to
define a binary file format (or worse, try to get agreement on a
general binary system stats kernel to userspace API). Text formats
are usually easily augmented, upgraded, inspected, and so on, and
they are easy to provide (the kernel actually has a lot of
infrastructure for easily providing text in /proc files; <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerLockClients">we saw
some of it in action recently</a>).</p>
<p>(These factors are especially visible in the case of some of the
statistics that <a href="https://zfsonlinux.org/">OpenZFS on Linux</a> exposes.
ZFS comes from Solaris, which has a native binary <a href="https://illumos.org/man/3KSTAT/kstat">'kstat'</a> system. ZoL exposes all of
these kstats in /proc/spl/kstat/zfs as text, rather than try to get
Linux people to somehow get them as binary kstats. <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxGettingPoolIostats">Other ZFS IO
statistics</a> are exposed in an entirely
different and more binary form.)</p>
<p>Changing the situation would require a lot of work by a lot of
people spread across a lot of projects, so it's unlikely to be done.
If it is ever done, it will probably be done piecemeal, maybe through
more and more kernel subsystems exposing information through
<a href="https://man7.org/linux/man-pages/man7/netlink.7.html">netlink</a> as well as /proc (perhaps exposing new metrics only
through netlink, with their /proc information frozen). But even
netlink is probably more work for kernel developers than putting
things in /proc, so I suspect that a lot of things will keep being
in /proc.</p>
<p>(In addition, lots of things in /proc aren't just pairs of names
and numbers, although that's the common case. Consider <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ProcLocksNotesII">/proc/locks</a>.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/KernelIntegersToTextThought?showcomments#comments">5 comments</a>.) </div>The time our Linux systems spend on integer to text and back conversions2024-02-26T21:43:53Z2023-05-16T02:19:01Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/DrgnVersusEBPFToolscks<div class="wikitext"><p>I talked recently about <a href="https://drgn.readthedocs.io/en/latest/index.html">drgn</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DrgnKernelPokingPraise">using it
to poke around in the kernel</a>, and yesterday
I followed that up with <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerLockClients">an example of finding out which NFS client
owns a file lock</a> that used bpftrace (and
also I discussed using <a href="https://drgn.readthedocs.io/en/latest/index.html">drgn</a> for this). As an outsider, you might
reasonably wonder when you'd use one and when you'd use the other
on the kernel. I won't claim that I have a complete answer, but
here's what I know so far.</p>
<p>(Both bpftrace and drgn can do things with user programs too, but
I haven't tried either for this.)</p>
<p>The simple version is that bpftrace is for doing things when events
happen in the kernel and drgn is for pulling information out of
kernel variables and data structures. Bpftrace has a crossover
ability to pull some information out of some data structures (that's
part of what makes it so useful), but often it's much more limited
than drgn.</p>
<p>Bpftrace will let you 'trace' kernel events, including events like
function calls, and do various things when they happen, such as
extracting information from arguments to the events (including
function arguments, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerLockClients">as we saw with the NFS locks example</a>). However, bpftrace has only limited support
for pretty-printing things, limited access to kernel global variables
(today it appears unable to access many module globals), and can't
do much with kernel data structures like linked lists or per-cpu
variables. Bpftrace will work out of the box on almost any modern
Linux kernel in its stock setup; at most you'll need the kernel
headers.</p>
<p>One painful example of a bpftrace limitation, many interesting
kernel data structures contain a 'struct path' that can be used to
give you the full path to the object involved, such as a file that's
locked, a file being accessed over NFS, or a NFS mount point.
Bpftrace generally has very limited ability to traverse these path
data structures to turn them into the actual path, while drgn has
<a href="https://drgn.readthedocs.io/en/latest/helpers.html#drgn.helpers.linux.fs.d_path">a simple helper for it</a>.</p>
<p>(One reason for this limitation is that the kernel won't allow eBPF
bytecode to have unpredictable, potentially unbounded runtime.)</p>
<p>So, for a non-hypothetical example, if you want to get a top-like
view of NFS server activity broken down by user or client, you need
bpftrace (see the very impressive <a href="https://github.com/FrauBSD/nfsdtop">nfsdtop</a>), even though some aspects are
rather awkward, because you need to 'trace' NFS requests.</p>
<p>Drgn is great for pretty-printing kernel data structures and
extracting relatively arbitrary information from them, both for
interactive exploration and to be automated in programs. However,
the data you're interested in mostly needs to be reachable from
some kernel global variable, and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerLockClients">figuring out how to get from
some global variable to the data you want can be an adventure</a>. In addition, drgn requires per-kernel setup
on any machine you want to use it on, because it requires kernel
debugging information that most distributions don't install by
default.</p>
<p>If both bpftrace and drgn can reach the kernel data structures
you're interested in, drgn in interactive mode is generally going
to be much more convenient for exploring them. It has much better
pretty-printing support, it will readily tell you about all of the
types involved, and its interactive mode is much faster than
repeatedly modifying and re-starting bpftrace programs to print a
few more things.</p>
<p>However, if you want to inspect short-lived objects, for example
ones that are only passed around as function arguments and are
deallocated when the operation is over, you need bpftrace. A short
lived, dynamically allocated object is beyond drgn's feasible reach.
As an example, if you want to snoop into the data structures that
NFS servers use to represent requests from NFS clients while the
requests are being processed, you're going to need bpftrace.</p>
<p>(If you have a hybrid situation where there is a long lived data
structure that isn't reachable from global variables, I suppose you
could get bpftrace to print its address as exposed during a function
call, then immediately turn to drgn to start dumping memory.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DrgnVersusEBPFTools?showcomments#comments">3 comments</a>.) </div>When to use drgn instead of eBPF tools like bpftrace, and vice versa2024-02-26T21:43:53Z2023-05-09T03:14:43Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NFSServerLockClientscks<div class="wikitext"><p>Suppose that you have <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">some Linux NFS servers</a>,
which have some <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ProcLocksNotesII">NFS locks</a>, and you'd like to
know which NFS client owns which lock. Since <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/NFSv3LockRecovery">the NFS server can
drop a client's locks when it reboots</a>,
this information is in the kernel data structures, but it's not
exposed through public interfaces like /proc/locks. <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DrgnKernelPokingPraise">As I mentioned
yesterday while talking about <code>drgn</code></a>, I've
worked out how to do this, so <a href="https://mastodon.social/@cks/110323200412866718">in case someone's looking for this
information</a>, here
are the details. This is as of Ubuntu 22.04, but I believe this
code is relatively stable (although where things are in the header
files has changed since 22.04's kernel).</p>
<p>In the rest of this I'll be making lots of references to kernel
data structures implemented as C structs in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/fs.h">include/linux/fs.h</a>,
<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h">include/linux/lockd/lockd.h</a>,
and <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/filelock.h">include/linux/filelock.h</a>.
To start with, I'll introduce our cast of characters, which is to say
various sorts of kernel structures.</p>
<ul><li>'<code>struct nlm_host</code>' represents a NFS client (on an NFS server), or
more generally a NLM peer. It contains the identifying information
we want in various fields, and so our ultimate goal is to associate
(NFS) file locks with nlm_hosts. I believe that a given
nlm_host can be connected to multiple locks, since a NFS
client can have many locks on the server.</li>
<li>'<code>struct nlm_lockowner</code>' seems to represent the 'owner' of a lock.
It's only interesting to us because it contains a reference to
the nlm_host associated with the lock, in '<code>.host</code>'.<p>
</li>
<li>'<code>struct lock_manager_operations</code>' is a set of function pointers
for lock manager operations. There is a specific instance of this,
'<code>nlmsvc_lock_operations</code>', which is used for all lockd/NLM locks.<p>
</li>
<li>'<code>struct file_lock</code>' represents a generic "file lock", POSIX or
otherwise. It contains a '<code>.fl_lmops</code>' field that points to a
lock_manager_operations, a '<code>.fl_pid</code>' field of the nominal
PID that owns the lock, a '<code>.fl_file</code>' that points to the
'<code>struct file</code>' that this lock is for, and a special '<code>.fl_owner</code>'
field that holds a '<code>void *</code>' pointer to lock manager specific
data. For lockd/NLM locks, this is a pointer to the associated
'<code>struct nlm_lockowner</code>' for the lock, from which we can get
the nlm_host and the information we want.<p>
All lockd/NLM locks will have a '<code>.fl_lmops</code>' field that
points to '<code>nlmsvc_lock_operations</code>' and a '<code>.fl_pid</code>'
that has lockd's PID.<p>
(The POSIX versus flock versus whatever type of a lock is
not in '<code>.fl_type</code>' but is instead encoded as set bits
in '<code>.fl_flags</code>'. Conveniently, all NFS client locks are
POSIX locks so we don't have to care about this.)<p>
</li>
<li>'<code>struct inode</code>' represents a generic, in-kernel inode. It
contains an '<code>.i_sb</code>' pointer to its 'superblock' (really
its mount), its '<code>.i_ino</code>' inode number, and '<code>.i_flctx</code>',
which is a pointer to '<code>struct file_lock_context</code>', which
holds context for all of the locks associated with this inode;
'<code>.i_flctx->flc_posix</code>' is the list of POSIX locks associated
with this inode (there's also eg '<code>.flc_flock</code>' for flock locks).</li>
<li>'<code>struct file</code>' represents an open file in the kernel, including
files 'opened' by lockd/NLM in order to get locks on them for NFS
clients. It contains a '<code>.f_inode</code>' that points to the file's
associated '<code>struct inode</code>', among other fields.
If you want filename information about a struct file,
you also want to look at '<code>.f_path</code>', which points to the file's
'<code>struct path</code>'; see <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/path.h">include/linux/path.h</a>
and <a href="https://drgn.readthedocs.io/en/latest/helpers.html#drgn.helpers.linux.fs.d_path">drgn's '<code>d_path()</code>' helper</a>.<p>
</li>
<li>'<code>struct nlm_file</code>' is the lockd/NLM representation of a file
held open by lockd/NLM in order to get a lock on it, and for
obvious reasons has a pointer to the corresponding '<code>struct
file</code>'. For reasons I don't understand, this is actually stored
in a two-element array, '<code>.f_file[2]</code>'; which element is used
depends on whether the file was 'opened' for reading or writing.</li>
</ul>
<p>There are two paths into determining what NFS client holds what
(NFS) lock, the simple and the more involved. In the simple path,
we can start by traversing all generic kernel locks somehow, which
is to say we start with '<code>struct file_lock</code>'. For each one, we
check that '<code>.fl_lmops</code>' is '<code>nlmsvc_lock_operations</code>' or that
'<code>.fl_pid</code>' is lockd's PID, then cast '<code>.fl_owner</code>' to a '<code>struct
nlm_lockowner *</code>', dereference it and use its '<code>.host</code>' to reach
the '<code>struct nlm_host</code>'.</p>
<p>One way to do this is to use bpftrace to hook into
'<code>lock_get_status()</code>' in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/locks.c">fs/locks.c</a>,
which is called repeatedly to print each line of /proc/locks and
is passed a '<code>struct file_lock *</code>' as its second argument (this
also conveniently iterates all current file locks for you). We also
have the <code>struct file</code> and thus the <code>struct inode</code>, which will
give us identifying information about the file (the major and minor
device numbers and its inode, which is <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ProcLocksNotes">the same information in
/proc/locks</a>). The '<code>struct nlm_host</code>' has several
fields of interest, including what seems to be the pre-formatted
IP address in <code>.h_addrbuf</code> and the client's name for itself in
<code>.h_name</code>.</p>
<p>So here's some bpftrace (not fully tested and you'll need to provide
the lockd PID yourself, and also maybe include some header files):</p>
<blockquote><pre style="white-space: pre-wrap;">
kprobe:lock_get_status
/((struct file_lock *)arg1)->fl_pid == <your lockd PID>/
{
$fl = (struct file_lock *)arg1;
$nlo = (struct nlm_lockowner *)$fl->fl_owner;
$ino = $fl->fl_file->f_inode;
$dev = $ino->i_sb->s_dev;
printf("%d: %02x:%02x:%ld inode %ld owned by %s ('%s')\n",
(int64)arg2,
$dev >> 20, $dev & 0xfffff, $ino->i_ino,
$ino->i_ino,
str($nlo->host->h_addrbuf),
str($nlo->host->h_name));
}
</pre>
</blockquote>
<p>(Now that I look at this a second time, you also want to look at
the fifth argument, arg4 (an int32), because if it's non-zero I
believe this is a pending lock, not a granted one. You may want to
either skip them or print them differently.)</p>
<p>This will print the same indexes and (I believe) the same
major:minor:inode information as /proc/locks, but add the NFS client
information. To trigger it you must read /proc/locks, either directly
or by using lslocks.</p>
<p>Another way is to use <a href="https://drgn.readthedocs.io/en/latest/index.html"><code>drgn</code></a> to go through
the global list of file locks, which is a per-cpu kernel hlist under
the general name '<code>file_lock_list</code>'. In interactive drgn, it
appears that you traverse these lists as follows:</p>
<blockquote><pre style="white-space: pre-wrap;">
for i in for_each_present_cpu(prog):
fll_cpu = per_cpu(prog['file_lock_list'], i)
for flock in hlist_for_each_entry('struct file_lock', fll_cpu.hlist, 'fl_link'):
[do whatever you want with flock]
</pre>
</blockquote>
<p>I'm not quite sure if you want present CPUs, online CPUs, or possible
CPUs. Probably you don't have locks for CPUs that aren't online.</p>
<p>The second path in is that the NFS NLM code maintains a global data
structure of all '<code>struct nlm_file</code>' objects, in '<code>nlm_files</code>',
which is an array of hlists, per <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svcsubs.c">fs/lockd/svcsubs.c</a>.
Starting with these '<code>nlm_file</code>' structs, we can reach the generic
file structs, then each file's inode, then the inode's lock context,
and finally the POSIX locks in that lock context (since we know
that all NFS locks are POSIX locks). This gives us a series of
'<code>file_lock</code>' structs, which puts us at the starting point above.</p>
<p>(The lock context '<code>.flc_posix</code>' is a plain list, not a hlist,
and they're chained together with the '<code>.fl_list</code>' field in
file_lock. Probably most inodes with NFS locks will have only
a single POSIX lock on them.)</p>
<p>So we have more or less:</p>
<blockquote><p>walk <code>nlm_files</code> to get a series of <code>struct nlm_file</code> → get one
<code>.f_file</code> <br>
→ <code>.f_inode</code> → <code>.i_flctx</code> → walk
<code>.flc_posix</code> to get a series of <code>struct file_lock</code>
(probably you usually get only one) <br>
→ check that <code>.fl_lmops</code>
is <code>nlmsvc_lock_operations</code> to know you have an NFS lock,
and then follow <code>.fl_owner</code> casting it as
a <code>struct nlm_lockowner *</code> <br>
→ .host → { <code>.h_addrbuf</code>,
<code>.h_name</code>, and anything else you want from <code>struct nlm_host</code> }</p>
</blockquote>
<p>If this doesn't make sense, sorry. I don't know a better way to
represent data structure traversal in something like plain text.</p>
<p>(Also, having written this I've realized that you might need to
make sure you visit each given inode only once. In theory multiple
generic file objects can all point to the same inode, and so
repeatedly visit its list of locks. I'm not sure this can happen
with NFS locks; the lockd/NLM system may reuse nlm_file entries
across multiple clients getting shared locks on the same file.)</p>
<p>Since starting from <code>nlm_files</code> requires several walks of list-like
structures that will generate multiple entries and starting from a
<code>struct file_lock</code> doesn't, you can see why I called the latter
the simpler case. Now that I've found the '<code>file_lock_list</code>'
global and learned how to traverse it in drgn in the course of
writing this entry, I don't think I'll use the '<code>nlm_files</code>'
approach in the future; it's strictly a historical curiosity of
how I did it the first time around. And starting from the global
file lock list guarantees you're reporting on each file lock only
once.</p>
<p>(I was hoping to be able to spot a more direct path through the
<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd">fs/lockd</a>
code, but the path I outlined above really seems to be how lockd
does it. See, for example, '<code>nlm_traverse_locks()</code>' in
<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svcsubs.c">fs/lockd/svcsubs.c</a>, which starts with a '<code>struct nlm_file *</code>'
and does the process I outlined above.)</p>
</div>
Finding which NFS client owns a lock on a NFS server via Linux kernel delving2024-02-26T21:43:53Z2023-05-07T02:29:58Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/DrgnKernelPokingPraisecks<div class="wikitext"><p>I've been keeping my eyes on <a href="https://drgn.readthedocs.io/en/latest/index.html">drgn</a> (<a href="https://github.com/osandov/drgn">repository</a>, <a href="https://lwn.net/Articles/789641/">2019 LWN article</a>) for some time, because it held
promise for being a better way to poke around your Linux kernel
than the venerable <a href="https://man7.org/linux/man-pages/man8/crash.8.html">crash(8)</a> program (which
I've actually used in anger, and it was a lot of work). Today, for
the first time, I got around to using drgn and the experience was
broadly positive.</p>
<p>I used drgn on an Ubuntu 22.04 <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">test NFS server</a>,
by creating a Python 3 venv, installing drgn into the venv, and
then running it from there (after installing the necessary kernel
debugging information from Ubuntu); this worked fine and 'drgn'
gave me a nice interactive Python environment where with minimal
knowledge of drgn itself I could poke around the kernel. Specifically,
<a href="https://mastodon.social/@cks/110316823642495251">I could poke into the various data structures maintained by the
kernel NFS NLM system</a>,
with the goal of being able to see which NFS client owned each NFS
lock on the server (or in this case, a lock, since it was a test
server and I established only a single lock to it for simplicity).</p>
<p>Drgn in interactive mode works quite well for this sort of exploration
for a number of reasons. To start with it does a remarkably good
job of pretty-printing structures (and arrays) with type and content
information of all of the fields. Simply being able to see the
contents of various things (and type information for pointers) led
me to make some useful discoveries. However, sometimes you'll be
confronted with things like this:</p>
<blockquote><pre style="white-space: pre-wrap;">
>>> prog['nlm_files']
(struct hlist_head [128]){
[...]
{
.first = (struct hlist_node *)0xffff8974099ae600,
},
</pre>
</blockquote>
<p>This is a message from drgn to you that you're going to be reading
some kernel source code and kernel headers in order to figure out
your next step. The good news is that drgn supports all of the
kernel's normal ways of traversing these sorts of data structures,
in a way that's very similar to the kernel's own code for it, to
the point where an outsider like me can translate back and forth.
For instance, if you have kernel code that looks like:</p>
<blockquote><pre style="white-space: pre-wrap;">
hlist_for_each_entry_safe(file, next, &nlm_files[i], f_list) {
</pre>
</blockquote>
<p>Then the drgn equivalent you want (hard-coding the index by
experimentation because this is exploration):</p>
<blockquote><pre style="white-space: pre-wrap;">
>>> r = list( hlist_for_each_entry('struct nlm_file', prog['nlm_files'][6].address_of_(), 'f_list') )
>>> r
[Object(prog, 'struct nlm_file *', value=0xffff8974099ae600)]
</pre>
</blockquote>
<p>(We use <code>list()</code> for the usual Python reason that drgn's helper
function returns a Python generator, and we want to poke at the
actual results in a simple way. Also, technically these are in
drgn.helpers.linux, which you may want to import specifically so
you can read the help text for. Or see <a href="https://drgn.readthedocs.io/en/latest/user_guide.html">the user guide</a> and <a href="https://drgn.readthedocs.io/en/latest/helpers.html">the
section on helpers</a>.)</p>
<p>You'll also need to read kernel source code and kernel headers in
order to <a href="https://mastodon.social/@cks/110317019840335032">dig your way through the kernel data structures to what
you want</a>. Drgn
won't (and can't) tell you how NLM data structures are linked
together and how you can go from, for example, the global '<code>nlm_files</code>'
to the '<code>struct nlm_host</code>' that tells you the NFS client that got
a particular lock. The path can be quite convoluted (<a href="https://mastodon.social/@cks/110317495214509757">cf</a>).</p>
<p>The good news is that if the kernel can do it, drgn probably can
do it too, although it may take you quite a bit of digging and
persistence to get there. The further good news is that if you can
do it in drgn's interactive mode, even painfully and with many
mis-steps, you can probably turn your worked out process into Python
code that uses drgn. Although <a href="https://mastodon.social/@cks/110317899106692511">I (temporarily) turned to other
tools for now</a>,
being able to explore and test ideas with drgn was essential to
getting there. Now that I've used drgn for this, I'll likely to be
turning to it for similar explorations and information extraction
in the future.</p>
<p>In addition to needing to know Python and be able to read kernel
code and headers, drgn's other drawback is that you need kernel
debugging information, and on most Linuxes these days that's not
installed by default. Installing it may be a bit annoying and it's
generally rather big; <a href="https://drgn.readthedocs.io/en/latest/getting_debugging_symbols.html">drgn's documentation has a guide</a>.
This means that drgn doesn't work out of the box the way tools like
bpftrace do.</p>
<p>(It would be great if drgn could use the kernel's <a href="https://www.kernel.org/doc/html/latest/bpf/btf.html">BPT Type Format
(BTF)</a>
information, which bpftrace and other eBPF tools already use, but
apparently there are various obstacles. I believe that drgn is
tracking this in <a href="https://github.com/osandov/drgn/issues/176">DWARFless Debugging #176</a>.)</p>
</div>
Some early praise for using drgn for poking into Linux kernel internals2024-02-26T21:43:53Z2023-05-06T03:31:20Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/FlockFcntlAndNFScks<div class="wikitext"><p>Unix broadly and Linux specifically has long had three functions
that can do file locks, <a href="https://man7.org/linux/man-pages/man2/flock.2.html">flock()</a>, <a href="https://man7.org/linux/man-pages/man2/fcntl.2.html">fcntl()</a>, and <a href="https://man7.org/linux/man-pages/man3/lockf.3.html">lockf()</a>. The latter
two are collectively known as 'POSIX' file locks because they appear
in the POSIX specification (and on Linux lockf() is just a layer
over fcntl()), while flock() is a separate thing with somewhat
different semantics (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/FlockFcntlChange">cf</a>), as it originated in
BSD Unix. In <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ProcLocksNotesII">/proc/locks</a>, flock() locks are
type 'FLOCK' and fcntl()/lockf() locks are type 'POSIX', and you
can see both on a local system.</p>
<p>(In one of those amusing things, in Ubuntu 22.04 crond takes a
flock() lock on /run/crond.pid while atd takes a POSIX lock on
/run/atd.pid.)</p>
<p>Because they're different types of locks, you can normally obtain
both an exclusive flock() lock and an exclusive fcntl() POSIX lock
on the same file. As a result of this, some programs adopted the
habit of normally obtaining both sorts of locks, just to cover their
bases for interacting with other unknown programs who might lock
the file.</p>
<p>In the beginning on Linux (before 2005), flock() locks didn't work
at all over NFS (on Linux); they were strictly local to the current
machine, so two programs on two different machines could obtain
'exclusive' flock locks on the same file. Then 2.6.12's NFS client
code was modified to accept flock() locks and silently change them
into POSIX locks (that did work over NFS, in NFS v3 through the NLM
protocol). <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSSambaLockingII">This caused heartburn for programs and setups that
were obtaining both sorts of (exclusive) locks on the same file</a>, because obviously two POSIX locks conflict
with each other and your NFS server will not let you have conflicting
locks like that. This change is effectively invisible to the NFS
client's kernel, so flock() locks on a NFS mounted filesystem will
show up in the client's /proc/locks (and <a href="https://man7.org/linux/man-pages/man8/lslocks.8.html">lslocks</a>) as type
'FLOCK'. However, on your NFS server all locks from NFS clients are
listed as type 'POSIX' in /proc/locks (and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ProcLocksNotesII">these days they're all
'owned' by lockd</a>), because that is what they
are.</p>
<p>(One reason for this is that the NFS v3 NLM protocol doesn't have
an idea of different types of locks, apart from exclusive or
non-exclusive.)</p>
<p>Unfortunately, this change creates another surprising situation,
which is that <strong>the NFS server and a NFS client can both obtain an
exclusive flock() lock on the same file</strong>. Two NFS clients trying
to exclusively flock() the same file will conflict with each other
and only one will succeed, but the NFS server and an NFS client
won't, and both will 'win' the lock (and everyone loses). This is
the inevitable but surprising consequence of client side flock()
locks being changed to POSIX locks on the NFS server, and POSIX
locks not conflicting with flock() locks. From the NFS server's
perspective, it's not two flock() exclusive locks on a file; it's
one exclusive POSIX lock (from a NFS client) and one exclusive local
flock() lock, and that's nominally fine.</p>
<p>In my opinion, this makes using flock() locking dangerous in general,
which is unfortunate since <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/FlockUsageNotes">the flock command uses flock() and
it's pretty much your best bet for locking in shell scripts</a> (see also <a href="https://man7.org/linux/man-pages/man1/flock.1.html">flock(1)</a>). Flock() is
only safe as a potentially cross-machine locking mechanism if you
can be confident that your NFS server will never be doing anything
except serving files via NFS. If things may be running locally on
the NFS server, for example <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/LocalVarMailImprovement">because you moved a very active NFS
filesystem to the primary machine that uses it</a>, then flock() becomes dangerous.</p>
<p>It also means that if you have a lock testing program, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/TestTheObvious">as I do</a>, you should make it default to either
fcntl() or lockf() locks, whichever you find easier, rather than
flock() locks. <a href="https://man7.org/linux/man-pages/man2/flock.2.html">Flock()</a> has the easiest
API out of the three locking functions, but it may give you results
that are between misleading and wrong if you're trying to use it
in <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerBreakingLocks">a situation where you want to check locking behavior between
a NFS server and a NFS client</a>, as I did
recently.</p>
<p>(Per <a href="https://man7.org/linux/man-pages/man5/nfs.5.html">nfs(5)</a>,
you can use the <code>local_lock</code> mount option to make flock() locks
purely local again on NFS v3 clients, but this doesn't exactly solve
the problem.)</p>
<p>PS: Given the server flock() issue, I kind of wish there was a generic
mount option to change flock() locks to POSIX locks, so that you could
force this to happen to NFS exported filesystems even on your NFS
fileserver. That would at least make the behavior the same on clients
and the server.</p>
<p>(This elaborates on <a href="https://mastodon.social/@cks/110307942997483067">a learning experience I mentioned on the
Fediverse</a>.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/FlockFcntlAndNFS?showcomments#comments">4 comments</a>.) </div>Flock() and fcntl() file locks and Linux NFS (v3)2024-02-26T21:43:53Z2023-05-05T03:13:16Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NFSServerBreakingLockscks<div class="wikitext"><p>As I discovered <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ProcLocksNotes">when I first explored /proc/locks</a>,
the Linux NFS server supports two special files in /proc/fs/nfsd
that will get it to break some of the locks it holds, '<code>unlock_ip</code>'
and '<code>unlock_filesystem</code>' (at least in theory). These files aren't
currently documented in <a href="https://man7.org/linux/man-pages/man7/nfsd.7.html">nfsd(7)</a>; the references
for them are <a href="https://www.spinics.net/lists/linux-nfs/msg57054.html">this 2016 linux-nfs message and thread</a> and <a href="https://people.redhat.com/rpeterso/Patches/NFS/NLM/004.txt">this
Red Hat document on them</a>. These
appear to have originally been intended for failover situations,
and one sign of this is that their names in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/nfsd/nfsctl.c">fs/nfsd/nfsctl.c</a>
are '<code>NFSD_FO_UnlockIP</code>' and '<code>NFSD_FO_UnlockFS</code>'.</p>
<p>Each file is used by writing something to it. For '<code>unlock_filesystem</code>'
this is straightforward:</p>
<blockquote><pre style="white-space: pre-wrap;">
# echo /h/281 >unlock_filesystem
</pre>
</blockquote>
<p>When you do this, all of the NFS locks on that filesystem are
immediately dropped by the NFS server. Any NFS clients who think
they held locks aren't told about this; as far as they know they
have the lock. NFS clients that were waiting to get a lock (because
the file was already locked) seem to eventually get given their
lock. Because existing lock holders get no notification, this is
only a safe operation to do if you're confident that there are no
real locks on the filesystem on any NFS clients, and any locks you
see on the NFS server are <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/NFSLocksStuckWorkaround">stuck NFS locks</a>, where the NFS server thinks some
NFS client has the file locked, but the NFS client disagrees.</p>
<p>We've tested doing this on our Ubuntu 22.04 fileservers (both in
production and in a testing environment) and it appears to work and
not have any unexpected side effects. It turns out that contrary
to what I thought in <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ProcLocksNotesII">my /proc/locks update for 22.04</a>,
the Ubuntu 22.04 lslocks can still mis-parse /proc/locks under some
circumstances; this is util-linux <a href="https://github.com/util-linux/util-linux/issues/1633">issue #1633</a>, which will
only be fixed in v2.39 when it gets released. Until then, build
from source or bug your distribution to pull in a fix.</p>
<p>(I had forgotten I'd filed <a href="https://github.com/util-linux/util-linux/issues/1633">issue #1633</a> last year and it had
gotten fixed back then and only re-discovered it while writing this
entry.)</p>
<p>I was going to write a number of things about '<code>unlock_ip</code>', but
it turns out that all I can write about this file is that I can't
get it to do anything. The kernel source code is in conflict between
whether the IP address you write is supposed to be a client IP
address (comments in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/nfsd/nfsctl.c">fs/nfsd/nfsctl.c</a>) or the server's IP address
as seen by clients (comments in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svcsubs.c">fs/lockd/svcsubs.c</a>;
the <a href="https://people.redhat.com/rpeterso/Patches/NFS/NLM/004.txt">Red Hat Page</a> talks
about failover in a way that suggests it was originally intended
for failover and to be given a (failover) server IP address. And
in practice on our testing <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">Ubuntu 22.04 NFS fileserver</a>, writing either IP address to '<code>unlock_ip</code>'
makes no difference in what /proc/locks says about locks (and how
other NFS clients waiting for locks react).</p>
<p>If '<code>unlock_ip</code>' worked, it would behave much the same for releasing
locks as <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/NFSv3LockRecovery">rebooting the NFS client</a>,
but without the whole 'reboot' business. Obviously you'd need to
be very sure that the NFS client didn't actually think it had any
NFS locks on the particular NFS server. Unfortunately Linux has no
easy way to send an artificial 'I have rebooted' notification to a
particular NFS server; however, you can use <a href="https://man7.org/linux/man-pages/man8/sm-notify.8.html">sm-notify(8)</a> on an NFS
client to tell all of the NFS servers that the client talks to that
the client has 'rebooted', which will cause all of them to release
their locks.</p>
<p>(Temporarily shutting down everything on a NFS client that might
try to get a NFS lock may be easier than rebooting it entirely.
Also, with enough contortions you could probably make <a href="https://man7.org/linux/man-pages/man8/sm-notify.8.html">sm-notify(8)</a>
send notifications to only a single fileserver, but it's clearly
not how sm-notify is intended to be used.)</p>
</div>
Forcefully breaking NFS locks on Linux NFS servers as of Ubuntu 22.042024-02-26T21:43:53Z2023-05-04T03:12:24Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ProcLocksNotesIIcks<div class="wikitext"><p>About a year ago, when we were still running <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our NFS fileservers</a> on Ubuntu 18.04, I investigated <code>/proc/locks</code>
a bit (it's documented in the <a href="https://man7.org/linux/man-pages/man5/proc.5.html">proc(5) manual page</a>). Since then
we've upgraded our fileservers to Ubuntu 22.04 (which uses Ubuntu's
'5.15.0' kernel), and there's some things that are a bit different
now, especially on NFS servers.</p>
<p>(Update: oops, I forgot to link to <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ProcLocksNotes">the first entry on /proc/locks</a>.)</p>
<p>On our Ubuntu 22.04 NFS servers, two things are different from how
they were in 18.04. First, <code>/proc/locks</code> appears to be complete
now, in that it shows all current locks held by NFS clients on NFS
exported filesystems. Along with this, the process ID in <code>/proc/locks</code>
for such NFS client locks is now consistently the PID of the kernel
'lockd' thread. This gives you a <code>/proc/locks</code> that looks like this:</p>
<blockquote><pre style="white-space: pre-wrap;">
1: POSIX ADVISORY WRITE 13602 00:4f:2237553 0 EOF
2: POSIX ADVISORY WRITE 13602 00:2e:486322 0 EOF
3: POSIX ADVISORY WRITE 13602 00:2e:485496 0 EOF
4: POSIX ADVISORY WRITE 13602 00:2e:486562 0 EOF
5: POSIX ADVISORY WRITE 13602 00:2e:486315 0 EOF
6: POSIX ADVISORY WRITE 13602 00:2e:541938 0 EOF
7: POSIX ADVISORY WRITE 13602 00:4a:2602201 0 EOF
8: POSIX ADVISORY WRITE 13602 00:2b:7233288 0 EOF
9: POSIX ADVISORY WRITE 13602 00:4a:877382 0 EOF
10: POSIX ADVISORY WRITE 13602 00:4a:877913 0 EOF
11: FLOCK ADVISORY WRITE 9990 00:19:4993 0 EOF
[...]
</pre>
</blockquote>
<p>All of those locks except the last one are NFS locks 'held' by the
lockd thread. If you use <a href="https://man7.org/linux/man-pages/man8/lslocks.8.html">lslocks(8)</a> it shows
'lockd' (and the PID), making it easy to scan for NFS locks. Lslocks
is no more able to find out the actual name of the file than it was
before, because the kernel 'lockd' thread doesn't have them open
and so lslocks can't do its trick of looking in /proc/<pid>/fd for
them.</p>
<p>(Your /proc/locks on a 22.04 NFS server is likely to be bigger than
it was on 18.04, possibly a lot bigger.)</p>
<p>The Ubuntu 22.04 version of lslocks is not modern enough to be able
to list the inode of these locks (which is available in /proc/locks).
However a more recent version of <a href="https://en.wikipedia.org/wiki/Util-linux">util-linux</a> does have such a version
of lslocks; support for listing the inode number was added in
util-linux 2.38, and it's not that difficult to build your own copy
of lslocks on 22.04. The version I built is willing to use the
shared libraries from the Ubuntu util-linux package, so you can
just pull the built binary out. </p>
<p>(Locally I wrote a cover script that runs our specially built modern
lslocks with '-u -o COMMAND,TYPE,MODE,INODE,PATH', because if we're
looking into NFS locks on a fileserver the other information usually
isn't too useful.)</p>
<p>These two changes make it much easier to diagnose or rule out
<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/NFSLocksStuckWorkaround">'stuck' NFS locks</a>, because now
you can reliably see all of the locks that the NFS server does or
doesn't hold, and verify if one of them is for the file that just
can't be successfully locked on your NFS clients. If you have access
to all of the NFS clients that mount a particular filesystem, you can
also check to be sure that none of them have a file locked that the
server lists as locked by lockd.</p>
<p>(Actually dealing with such a stuck lock is beyond the scope of
this entry. There is <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/NFSLocksStuckWorkaround">a traditional brute force option</a> and some other approaches.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ProcLocksNotesII?showcomments#comments">One comment</a>.) </div>More notes on Linux's <code>/proc/locks</code> and NFS as of Ubuntu 22.042024-02-26T21:43:53Z2023-04-29T01:51:33Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSOnLinuxSettingARCSizecks<div class="wikitext"><p>In the past <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxNeedsARCControl">I've grumbled about wanting a way to explicitly set
the (target) ARC size</a>. After all of
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCTargetSizeChanges">my recent investigation into how the ARC grows and shrinks</a>, I now believe that this can be
safely done, at least some of the time. However, growing (or in
general resizing) the <a href="https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#adaptive-replacement-cache">ZFS ARC</a>
comes with a number of caveats, because it's only going to be
effective some of the time.</p>
<p>The simple and brute force way to grow the ARC target size to a
given number is to briefly and temporarily raise <a href="https://openzfs.github.io/openzfs-docs/man/4/zfs.4.html#zfs_arc_min">zfs_arc_min</a>
to your desired value, which can be done through
/sys/module/zfs/parameters. After having spent some time going
through the ARC code, I'm relatively convinced that this is safe
and won't trigger immediate consequences. You can similarly reduce
the ARC target size by (temporarily) lowering <a href="https://openzfs.github.io/openzfs-docs/man/4/zfs.4.html#zfs_arc_max">zfs_arc_max</a>.
In both cases this has an immediate effect on '<code>c</code>', the ARC target
size; when you set either the maximum or the minimum, the ZFS code
immediately sets '<code>c</code>' if it's necessary.</p>
<p>However, raising the ARC target size will only have a meaningful effect
if ZFS can actually use more memory. If the free memory situation is bad
enough that <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCMemoryStatistics">memory_available_bytes</a>
is negative, your newly set ARC target size will pretty much immediately
start shrinking, possibly significantly, and the ARC will have no chance
to actually use much more extra memory. If available memory is positive
but not very large, it may turn negative once the ARC's actual size
grows a bit more and then <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCTargetSizeChanges">ZFS will shrink your recently-raised ARC
target size back down</a>, along with
probably shrinking the ARC's actual memory use.</p>
<p>Given all of this, there seem to be two good cases to deliberately
raise the ARC target size. The first case is if you've seen an odd
collapse in the ARC target size and you have a lot of free memory.
Here the ARC target size will probably grow on its own, eventually,
but it will likely do that in relatively small increments (such as
128 KiB at a time), while you can yank it right up now. The second
case is if the ARC target size is already quite big but arc_no_grow
is stuck at '1' because <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCMemoryStatistics">ZFS wants an extra 1/32nd of your large
target size to be available</a>; this
is probably more likely to be an issue if you've raised <a href="https://openzfs.github.io/openzfs-docs/man/4/zfs.4.html#zfs_arc_max">zfs_arc_max</a>
(as we have on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our fileservers</a>).</p>
<p>(As far as I can tell from looking at the code, arc_no_grow
being 1 doesn't prevent the ARC from allocating extra memory to
grow up to the ARC target size; it just prevents the ARC target
size from growing further.)</p>
<p>In theory you can lock the ARC target size at a specific value by
boxing it in by setting <a href="https://openzfs.github.io/openzfs-docs/man/4/zfs.4.html#zfs_arc_min">zfs_arc_min</a> to sufficiently close
to <a href="https://openzfs.github.io/openzfs-docs/man/4/zfs.4.html#zfs_arc_max">zfs_arc_max</a>. While this will keep ZFS from lowering the
target size, it won't keep either ZFS or the general kernel 'shrinker'
memory management feature from frantically trying to reclaim memory
from the ARC if actual available memory isn't big enough. Fighting
the kernel is probably not going to give you great results.</p>
</div>
Setting the ARC target size in ZFS on Linux (as of ZoL 2.1)2024-02-26T21:43:53Z2023-04-21T02:58:04Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSOnLinuxARCTargetSizeChangescks<div class="wikitext"><p>Previously I discussed <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSARCItsVariousSizes">the various sizes of the ARC</a>, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCMemoryStatistics">some important ARC memory stats</a>, and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCMemoryReclaimStats">ARC memory reclaim stats</a>. Today I can finally talk about
how the <a href="https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#adaptive-replacement-cache">ZFS ARC</a>
target size shrinks, and a bit about how it grows, which is <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxNeedsARCControl">a
subject of significant interest and some frustration</a>. I will be citing ZoL function names because
tools like <a href="https://github.com/iovisor/bpftrace">bpftrace</a> mean you can
hook into them to monitor ARC target size changes.</p>
<p>(Changes in the actual size of the ARC are less interesting than changes
in the ARC target size. Generally the actual size promptly fills up to
the target size if you're doing enough IO, although metadata versus data
balancing can throw a wrench in the works.)</p>
<p>The ARC target size is shrunk by arc_reduce_target_size() (in
<a href="https://github.com/openzfs/zfs/blob/master/module/zfs/arc.c">arc.c</a>),
which takes as its argument the size (in bytes) to reduce arc_c
by and almost always does so (unless you've hit the minimum size).
There are two paths to calling it, through <em>reaping</em>, where ZFS
periodically checks to see if it thinks there's not enough memory
available, and <em>shrinking</em>, where the Linux kernel memory management
system asks ZFS to shrink its memory use.</p>
<p>Reaping is a general ZFS facility where a dedicated kernel thread
wakes up at least once every second to check if memory_available_bytes
is negative. If it is, ZFS sets arc_no_grow, kicks off reclaiming
memory, waits about a second, and then potentially shrinks the ARC
target size by:</p>
<blockquote><pre style="white-space: pre-wrap;">
( (arc_c - arc_c_min) / 128 ) - memory_available_bytes
</pre>
</blockquote>
<p>(The divisor will be different if you've tuned <a href="https://openzfs.github.io/openzfs-docs/man/4/zfs.4.html#zfs_arc_shrink_shift">zfs_arc_shrink_shift</a>.
This is done in arc_reap_cb(), and see also arc_reap_cb_check().)</p>
<p>Because reaping waits a second after starting the reclaim, this
number may not be positive (because the reclaim raised the amount
of available bytes enough); if this has happened, arc_c is left
unchanged. This reaping thread ticks once a second and may also be
immediately woken up by arc_adapt(), which is called when ZFS
is reading a new disk block into memory and which will check to see
if memory_available_bytes is below zero.</p>
<p>My bpftrace-based measurements so far suggest that when reaping
triggers, it normally makes relative large adjustments in the ARC
target size; I routinely see 300 and 400 MiB reductions even on my
desktops. Since the ARC target size reduction starts out at 1/128th
of the difference between the current ARC target size and the minimum
size, a system with a lot of memory and a large ARC size may
experience very abrupt drops through reaping, especially if you've
raised the maximum ARC size and left the minimum size alone.</p>
<p>The shrinking path is invoked through the Linux kernel's general
memory management feature of kernel subsystems having 'shrinkers'
that kernel memory management can invoke to reduce the subsystem's
memory usage (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCMemoryReclaimStats">this came up in memory reclaim stats</a>). When the kernel's memory
management decides that it wants subsystems to shrink, it will first
call arc_shrinker_count() to see how much memory the ARC can
return and then maybe call arc_shrinker_scan() to actually do
the shrinking. The amount of memory ARC will claim it can return
is calculated in a complex way (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCMemoryReclaimStats">see yesterday's discussion</a>) and is capped at
<a href="https://openzfs.github.io/openzfs-docs/man/4/zfs.4.html#zfs_arc_shrinker_limit">zfs_arc_shrinker_limit</a>
pages (normally 4 KiBytes each). All of this is in <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/arc_os.c">arc_os.c</a>.
Shrinking, unlike reaping, always immediately reduces arc_c
by however much the kernel wound up asking it to shrink by.</p>
<p>Although you might expect otherwise, the kernel's memory subsystem
can invoke the ARC shrinker even without any particular sign of
memory pressure, and when it does so it often only asks the ARC to
drop 128 pages (512 KiB) of data instead of the full amount that
the ARC offers. It can also do this in rapid bursts, which obviously
adds up to much more than just 512 KiB of total ARC target size
reduction.</p>
<p>Every time shrinking happens, one or the other of memory_indirect_count
and memory_direct_count are increased. No statistic is increased
if reaping happens, or if reaping leads to the ARC target size being
reduced (which it doesn't always). If you need that information,
you'll have to instrument things with something like <a href="https://github.com/cloudflare/ebpf_exporter">the EBPF
exporter</a>. Writing
the relevant BCC or bpftrace programs is up to you.</p>
<p>How and when the ARC target size is increased again is harder to
observe, although it's more centralized. The ARC target size is
grown in arc_adapt(), but unfortunately not all of the time;
it's only grown if the current ARC size is within 32 MiBytes of the
target ARC size (and the ARC can grow at all, ie arc_no_grow
is zero and there's no reclaim needed). As of ZoL 2.1, the ARC
target size is grown by however many bytes were being read from
disk, which may be as small as 4 KiB; in the current development
version, that's changed to a minimum of 128 KiB. As mentioned before,
arc_adapt() seems to be called only when ZFS wants to read new
things from disk (with a minor exception for some L2ARC in-RAM
structures).</p>
<p>(That the growth decision is buried away inside the depths of
arc_adapt() makes it hard to monitor even with bpftrace,
<a href="https://mastodon.social/@cks/110217068987764330">especially since arc_c itself isn't accessible to bpftrace</a>.)</p>
<p>One consequence of this is that even if the ARC target size can
grow, it only grows on ARC misses that trigger disk IO. If all of
your requests are being served from the current ARC, ZFS won't
bother growing the target size. This makes sense, but is potentially
frustrating and I believe it can cause the ARC target size to 'stick'
at alarmingly low levels for a while on a system that still has
high ARC hit rates even on a reduced-size ARC, or low IO levels.</p>
<h3>Sidebar: the shrinker call stack bpftrace has observed</h3>
<p>I had bpftrace print call stacks for arc_shrinker_scan(),
and what I got in my testing was:</p>
<blockquote><pre style="white-space: pre-wrap;">
arc_shrinker_scan+1
do_shrink_slab+318
shrink_slab+170
shrink_node+572
balance_pgdat+792
kswapd+496
[...]
</pre>
</blockquote>
<p>I lack the energy to try to decode why the kernel would go down this
particular path and what kernel memory metrics one would look at to
predict it.</p>
</div>
When and how ZFS on Linux changes the ARC target size (as of ZoL 2.1)2024-02-26T21:43:53Z2023-04-19T02:48:56Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSOnLinuxARCMemoryReclaimStatscks<div class="wikitext"><p>Yesterday I talked about <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCMemoryStatistics">some important ARC memory stats</a>, to go with <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSARCItsVariousSizes">stats on how
big the ARC is</a>. The ARC
doesn't just get big and have views on memory; it also has
information about when it shrinks and somewhat about why.
Most of these are exposed as event counters in
/proc/spl/kstat/zfs/arcstats, with arc_need_free as an
exception (it counts how many bytes ZFS thinks it currently
wants to shrink the ARC by).</p>
<p>The Linux kernel's memory management has 'shrinkers', which are
callbacks into specific subsystems (like ZFS) that the memory
management invokes to reduce memory usage. These shrinkers operate
in two steps; first the kernel asks the subsystem how much memory
it could possibly return, and then it asks the subsystem to do it.
The basic amount of memory that the ARC can readily return to the
system is the sum of <code>mru_evictable_data</code>, <code>mru_evictable_metadata</code>,
<code>mfu_evictable_data</code>, and <code>mfu_evictable_metadata</code> (the actual
answer is more complicated, see <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/arc_os.c#L135">arc_evictable_memory() in
arc_os.c</a>).
Normally this is limited by <a href="https://openzfs.github.io/openzfs-docs/man/4/zfs.4.html#zfs_arc_shrinker_limit">zfs_arc_shrinker_limit</a>,
so any single invocation will only ask the ARC to drop at most 160
MBytes.</p>
<p>Every time shrinking happens, the ARC target size is reduced by
however much the kernel asked ZFS to shrink, arc_no_grow is set
to true, and either <code>memory_indirect_count</code> or <code>memory_direct_count</code>
is increased. If the shrinking is being done by Linux's kswapd, it
is an indirect count; if the shrinking is coming from a process
trying to allocate memory, finding it short, and directly triggering
memory reclaiming (a 'direct reclaim'), it is a direct count. Direct
reclaims are considered worse than indirect reclaims, because they
indicate that kswapd wasn't able to keep up with the memory demand
and other processes were forced to throttle.</p>
<p>(I believe the kernel may ask ZFS to drop less memory than ZFS
reported it could potentially drop.)</p>
<p>The ARC has limits on how much metadata it will hold, both general
metadata, arc_meta_limit versus arc_meta_used, and for
dnodes specifically, arc_dnode_limit versus dnode_size.
When the ARC shrinks metadata, it may need to 'prune' itself by
having filesystems release dnodes and other things they're currently
holding on to. When this triggers, arc_prune will count up by
some amount; I believe this will generally be one per currently
mounted filesystem (see <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/arc_os.c#L504">arc_prune_async() in arc_os.c</a>).</p>
<p>When the ARC is evicting data, it can increase two statistics,
<code>evict_skip</code> and <code>evict_not_enough</code>. The latter is the number of
times ARC eviction wasn't able to evict enough to reach its target
amount. For the former, let's quote <a href="https://github.com/openzfs/zfs/blob/master/include/sys/arc_impl.h#L567">arc_impl.h</a>:</p>
<blockquote><p>Number of buffers skipped because they have I/O in progress, are
indirect prefetch buffers that have not lived long enough, or are
not from the spa we're trying to evict from.</p>
</blockquote>
<p>ZFS can be asked to evict a certain amount or all things of a particular
class that are evictable, such as MRU metadata. Only the former case can
cause <code>evict_not_enough</code> to count up.</p>
<p>In addition to regular data, the ARC can store 'anonymous' data.
I'll quote <a href="https://github.com/openzfs/zfs/blob/master/include/sys/arc_impl.h#L60">arc_impl.h</a>
again:</p>
<blockquote><p>Anonymous buffers are buffers that are not associated with
a DVA. These are buffers that hold dirty block copies
before they are written to stable storage. By definition,
they are "ref'd" and are considered part of arc_mru
that cannot be freed. Generally, they will acquire a DVA
as they are written and migrate onto the arc_mru list.</p>
</blockquote>
<p>The size of these are the <code>anon_size</code> kstat. Although there are
<code>anon_evictable_data</code> and <code>anon_evictable_metadata</code> stats, I
believe they're always zero because anonymous dirty buffers probably
aren't evictable. Some of the space counted here may be 'loaned out'
and shows up in <code>arc_loaned_bytes</code>.</p>
<p>As part of setting up writes, ZFS will temporarily reserve ARC space
for them; the current reservation is reported in <code>arc_tempreserve</code>.
Based on the code, the total amount of dirty data in the ARC for
dirty space limits and space accounting is <code>arc_tempreserve</code> plus
<code>anon_size</code>, minus <code>arc_loaned_bytes</code>.</p>
<p>Under some situations that aren't clear to me, ZFS may feel it needs
to throttle new memory allocations for writes. When this happens,
<code>memory_throttle_count</code> will increase by one. This seems to be
rare as it's generally zero on our systems.</p>
</div>
ARC memory reclaim statistics exposed by ZFS on Linux (as of ZoL 2.1)2024-02-26T21:43:53Z2023-04-18T02:47:21Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSOnLinuxARCMemoryStatisticscks<div class="wikitext"><p>The <a href="https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#adaptive-replacement-cache">ZFS ARC</a>
is ZFS's version of a disk cache, and ZFS on Linux reports various
information about it in /proc/spl/kstat/zfs/arcstats. Some of this
is <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSARCItsVariousSizes">information on how big the ZFS ARC is and wants to be</a>, but other parts contain important
information on how ZFS views the system's overall memory situation.
The general meaning of this information is system independent (I
believe it exists on FreeBSD and Illumons, as well as ZFS on Linux),
but how it's determined and derived is system specific and I've only
looked into the situation on Linux.</p>
<p><a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSARCItsVariousSizes">As covered</a>, the critical ARC
size parameter for determining if it will grow, shrink, or stay the
same size is '<code>c</code>', also known as '<code>arc_c</code>', which is what the
ARC considers the overall target size. ZFS also exposes three memory
sizes, <code>memory_all_bytes</code>, <code>memory_free_bytes</code>, and
<code>memory_available_bytes</code>. The 'all' number is how much total RAM
ZFS thinks the system has; the 'free' number is how much memory ZFS
thinks is free in general, and 'available' is how much memory ZFS
feels it has available to it at the moment, which can go negative.
If the 'available' number goes negative, the ARC shrinks; if it's
(enough) positive, the ARC can grow.</p>
<p>On Linux, the code that determines these is in <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/arc_os.c">arc_os.c</a>.
On most Linux systems, the 'free' is the number of free pages plus
the number of inactive file pages, which are visible in /proc/vmstat
as <code>nr_free_pages</code> and <code>nr_inactive_file</code>. On all Linux systems,
the 'available' number is the 'free' number minus '<code>arc_sys_free</code>',
which is normally somewhat over 1/32nd of your total RAM and doesn't
get adjusted on the fly by ZFS. You can set this through the
<a href="https://openzfs.github.io/openzfs-docs/man/4/zfs.4.html#zfs_arc_sys_free">zfs_arc_sys_free</a>
parameter.</p>
<p>(The manual page says that arc_sys_free is normally 1/64th of
RAM, but <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/arc_os.c#L332">the actual code says 1/32nd plus stuff</a>.)</p>
<p>Whether or not the ARC can grow at the moment is shown in
'<code>arc_no_grow</code>', which is 1 if the ARC can't grow at the moment.
Generally, this will turn on and stay on if 'available' is less
than 1/32nd of '<code>arc_c</code>' (the 1/32nd bit is determined by
'<code>arc_no_grow_shift</code>', which is an internal variable and so not
subject to tuning in ZFS on Linux). One implication of this is that
it's harder and harder for the ARC target size to grow toward its
maximum because you need more and more free memory as '<code>arc_c</code>'
gets larger and larger. On <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our ZFS fileservers with 192 GB of
RAM</a> we set the maximum ARC size to about
155 GB, so at the top end we need the 'free' memory number to reach
over 10 GB. It looks like we have gotten there sometimes, but it
doesn't happen very often.</p>
<p>(Most of our fileservers also spend 80% to 90% of their time with
'<code>arc_no_grow</code>' being 1.)</p>
<p>The situation for '<code>arc_no_grow</code>' is checked once a second, so
even without explicit memory pressure ARC growth will turn off when
'available' drops low enough; once '<code>arc_c</code>' is large, this may
be most of the time because of the minimum requirement above. If
'available' becomes negative (ie, if the 'free' memory drops below
'<code>arc_sys_free</code>'), then ZFS will consider there to be a 'memory
pressure event' and ARC growth can't turn back on until at least
<a href="https://openzfs.github.io/openzfs-docs/man/4/zfs.4.html#zfs_arc_grow_retry">zfs_arc_grow_retry</a>
seconds later, which defaults to five seconds. It's likely but not
certain that this will trigger the ARC target size shrinking.</p>
<p>If '<code>arc_need_free</code>' is non-zero, this means that ZFS on Linux
is in the process of trying to shrink the ARC by (at least) that
amount of bytes. This statistic is not used inside ZFS on Linux;
it purely exposes some state information, and I think it can be
zero even if the ARC is currently reclaiming memory.</p>
<h3>Sidebar: The ARC's target size versus its actual size</h3>
<p>It's entirely possible for the ARC to drop its memory usage without
dropping its target size (for example, if you delete a big file
that's been cached in the ARC, I think the ARC may drop the cached
blocks for the file). Over the last week, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our fileservers</a> have had the target size be up to 40 GB
more than the current size.</p>
<p>Differences the other way (when the target size is below the actual
size) seem to be much smaller. Even going back four weeks, the
largest shortfall is only a little bit over a GB. The obvious guess
is that ZFS seems to be quite prompt at shrinking the ARC along
side shrinking its target size.</p>
</div>
Some important ARC memory statistics exposed by ZFS on Linux (as of ZoL 2.1)2024-02-26T21:43:53Z2023-04-17T03:01:41Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemSoundsShouldBeGranularcks<div class="wikitext"><p>A while back I wrote about <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/KDESilencingApps">silencing KDE application notification
sounds under fvwm</a>, where the solution was to
open up the desktop settings or volume control application of your
choice and turn off the volume of what is variously called 'Notification
Sounds' or 'System Sounds'. Initially I did this through KDE and
thought it was a KDE-specific setting, but as pointed out to me in
comments, this is actually a global (to the system) sound stream
for system sounds and events. This is an extremely blunt hammer to
deal with unwanted notification sounds, and Linux should do better.
In practice this means both programs and desktop toolkits, such as
KDE and GTK.</p>
<p>Modern mobile devices show what's possible and what we should be
able to get on Linux. Mobile OSes such as iOS support mandatory
system level control over whether each application can use sounds
(or other things) in their notifications. Better applications also
provide granular control over what noises they may optionally make,
including 'none'. Of course there's a place for a global control,
but it should be your last resort if your issue is with a specific
application making unwanted noises, especially for non-urgent things.</p>
<p>(Some people will want their computer to make no noises at all, but
others may want to preserve some system sounds for, for example, urgent
issues or alerts.)</p>
<p>The state in current desktops (in Fedora 37) seems somewhat mixed.
Cinnamon gives you control over what sounds its desktop shell makes,
but not over what notification sounds its applications generate
(never mind applications from other desktops). GNOME, in its
'Notifications' section, offers specific controls over sound usage
by notification by application, but only for GNOME applications.
If you dig deep enough in KDE settings, KDE seems to have quite
fine grained control over notification sounds from certain KDE
applications, but this doesn't seem to include applications that
are merely KDE-based, like <a href="https://invent.kde.org/sdk/kdiff3">kdiff3</a>.</p>
<p>If GNOME's notification sound setting worked for everything, it
would have basically the interface I'd want (and be very similar
to what iOS and I believe Android support, which may not be a
coincidence). However, achieving this would probably require everyone
to agree on some sort of standard for notifications, either in the
form of 'send your notification stuff here to a daemon that will
actually make them appear' (and then the daemon could filter things),
or for exposing and configuring notification settings for arbitrary
programs.</p>
<p>Until such a beautiful day comes to pass, Linux sound notifications will
continue to be an irritating mess and probably irritate people when some
random program suddenly bings and bongs at them. The likely consequence
of this is that more people turn off system sounds entirely.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemSoundsShouldBeGranular?showcomments#comments">5 comments</a>.) </div>Notification sounds and system sounds on Linux should be granular2024-02-26T21:43:53Z2023-04-12T02:43:27Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/LinuxStaticLinkingVsGlibccks<div class="wikitext"><p>In Linux (and other operating systems), <a href="https://en.wikipedia.org/wiki/Name_Service_Switch">NSS (Name Service Switch)</a> is a mechanism
that lets the system implement name resolution for various sorts
of name lookups through a system of dynamically loaded shared
objects, configured through <a href="https://man7.org/linux/man-pages/man5/nsswitch.conf.5.html">/etc/nsswitch.conf</a>. Also
in Linux, in theory, you can statically link programs through the
'-static' argument to various programs like GCC and <a href="https://www.arp242.net/static-go.html">the Go toolchain</a>. Statically linking program
executables because <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GoAndGlibcVersioning">this can avoid situations where you can't run
an executable on an older Linux version than it was built one</a>.</p>
<p>You might wonder how NSS plays together with statically linking your
executables. The answer is that it doesn't:</p>
<blockquote><pre style="white-space: pre-wrap;">
warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
</pre>
</blockquote>
<p>What you get is a statically linked executable that still requires the
glibc version you linked with.</p>
<p>On the one hand, this is unappealing. On the other hand, there is
basically no way out. NSS modules are shared objects, and they may
use more or less arbitrary bits of the C library. Since they may
well have been compiled against the current version of glibc, these
bits may require that glibc version in order to work right, through
symbol versioning or otherwise. They also have to be able to resolve
random symbols in the C library, symbols that may not have been
used in your program and so were omitted from your static executable.</p>
<p>(There are also more direct issues of how dynamic lookup and indirect
calls are implemented in shared objects, but let's handwave those and
assume that you could in theory build versions of those out of your
static program at run time.)</p>
<p>Pretty much the only good answer is that basically you may need the
entire shared glibc in order to make a NSS shared object happy. So
you'd better have that glibc available.</p>
<p>Since I mostly haven't been trying to statically link my (C)
executables for some time for various reasons, this probably won't
make much of a difference to me. Linux distributions have been
making static linking more and more of a hassle in general, with
things like missing static versions of dynamic libraries, or at
least hidden ones that live in obscure packages. Although I see
that there is a readily available libreadline.a in Ubuntu; <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ReadlineDistroVersionMess">libreadline
has been one of my pain points over the years</a>.</p>
<p>(This is something I could have known long ago if I really paid
attention, but I only had my nose rubbed into it today for reasons
beyond the scope of this entry.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LinuxStaticLinkingVsGlibc?showcomments#comments">5 comments</a>.) </div>On Linux, you can't usefully statically link programs using NSS2024-02-26T21:43:53Z2023-04-10T03:25:58Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSOnLinuxNeedsARCControlcks<div class="wikitext"><p>Over on the Fediverse <a href="https://mastodon.social/@cks/110101574590648632">I said something about the ZFS ARC</a>:</p>
<blockquote><p>It has been '0' days since I wished for a way to directly set ZFS on
Linux's 'arc_c' internal parameter for the target size of the ZFS ARC.</p>
<p>Why yes, our ARCs are still collapsing for mysterious reasons on our
ZoL fileservers.</p>
</blockquote>
<p>(I was going to say that 'collapsing' is a relative term, but on
checking our metrics, we've seen some really remarkably low ARCs
for <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">fileservers</a> with 192 GB of RAM. It
looks like we had one drop as low as 56 GB in the past week.)</p>
<p>The <a href="https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#adaptive-replacement-cache">ZFS ARC</a>
is ZFS's version of a disk cache. For various reasons, ZFS on Linux
keeps its ARC separate from the kernel's regular disk caches that
are used for other filesystems, and ZFS tunes the ARC size and other
parameters separately, instead of the whole thing being integrated
into the kernel's general memory tuning.</p>
<p>An important parameter for ARC sizing is arc_c, which is the
target size for the ARC, as opposed to its current size. The ARC's
current size may drop significantly under memory pressure, but it
will grow back to arc_c given time. If arc_c also drops,
the ARC will generally not grow its memory use very fast; first
ZFS has to decide to raise arc_c, and then it has to have the
ARC grow to that new size.</p>
<p>As you might guess from my Fediverse post, <a href="https://zfsonlinux.org/">ZFS On Linux</a> doesn't directly expose any way to set
arc_c. If your ARC target size has collapsed down to ridiculously
low numbers, there's no straightforward way to change it. Sometimes
you can change the ZFS module parameter zfs_arc_max and this
seems to give ZFS a kick; otherwise, there is at most the brute
force and potentially dangerous approach of temporarily setting a
high zfs_arc_min (which has the obvious side effect of raising
arc_c to this new minimum value if necessary). However,
historically <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSOverPrefetchingUpdateII">setting zfs_arc_min has been dangerous</a>.</p>
<p>In addition, there's an additional internal ARC variable of whether
or not the ARC can grow; this is arc_no_grow in
/proc/spl/kstat/zfs/arcstat. If this is '1', I'm not sure that
having a high arc_c does you any good. This too is not something
that you can control, and it's not even obvious how decisions are
made about it.</p>
<p>(You can read all about this sausage in <a href="https://github.com/openzfs/zfs/blob/master/module/zfs/arc.c">arc.c</a> and
<a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/arc_os.c">arc_os.c</a>.)</p>
<p>ZFS On Linux having annoying issues with ARC size isn't a new issue;
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCShrinkage">we had this problem on the 18.04 versions of our fileservers</a>, and I've had it periodically on my desktop
machines. Since the problems with ARC sizing keep not getting fixed
in ZFS On Linux, I've come around to the idea that system administrators
should at least have a hammer that we can use to tell ZFS On Linux that
it's wrong and the ARC target size should really be 'X', for some X.</p>
<p>(Alternately, we at least need better documentation on all of the ARC
related metrics and probably better metrics, so that we can understand
what it did and why. Yes, I know, I have bpftrace and other eBPF tools
so in theory I can instrument the kernel code. I should not have to be
<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OperatorsAndSystemProgrammers">a system programmer</a> here.)</p>
</div>
ZFS On Linux (still) needs better ways to control the ZFS ARC2024-02-26T21:43:53Z2023-04-07T03:11:29Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/GnomeTerminalBiggerMarginscks<div class="wikitext"><p>For reasons outside the scope of this entry, I've recently been
giving Gnome-Terminal more of a try as my secondary terminal program,
instead of urxvt (my primary terminal program remains xterm).
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DefaultTerminalProgram">Gnome-Terminal is a perfectly fine default terminal program</a> and so most of this has gone perfectly
well. But the more I used gnome-terminal, the more I found myself
having problems with how by default it runs the text almost right
up against the edge of the window (on the left) and the scrollbar
(on the right). I found that gnome-terminal's lack of margins made
it harder for me to read and scan text at either edge.</p>
<p>(This is an interesting effect because xterm also runs text up to
about the same few pixels from the left and right edges, but somehow
xterm comes off as more readable and I've never felt it was an
issue. Some of it may be that xterm has a quite different rendering
for its scrollbar, one that I think creates the impression of more
margin than actually exists, and by default puts it on the left
side instead of the right. And of course I usually have much more
text up against the left side than the right.)</p>
<p>Fortunately it turns out that this is fixable. Gnome-terminal is a
GTK application and modern GTK applications can be styled through
a fearsome array of custom <a href="https://developer.mozilla.org/en-US/docs/Web/CSS">CSS</a> (yes, the web
thing, it's everywhere and for good reason). Courtesy of <a href="https://askubuntu.com/questions/115762/increase-padding-in-gnome-terminal">this
AskUbuntu question and its answers</a>,
I discovered that all you need is ~/.config/gtk-3.0/gtk.css with
a little bit in it:</p>
<blockquote><pre style="white-space: pre-wrap;">
VteTerminal,
TerminalScreen,
vte-terminal {
padding: 4px 4px 4px 4px;
-VteTerminal-inner-border: 4px 4px 4px 4px;
}
</pre>
</blockquote>
<p>The 4px is experimentally determined; I started with the answer's
10px and narrowed it down until it felt about right (erring on the
side of more rather than less space).</p>
<p>I've only tested this with gnome-terminal, so it's possible that
it will make other <a href="https://wiki.gnome.org/Apps/Terminal/VTE">VTE</a>
based terminal programs unhappy (although I tried xfce4-terminal
briefly and it seemed okay). However, in gnome-terminal it makes
me happy.</p>
<p>(In the way of the modern world, you need this gtk.css on every
remote machine that you intend to run gnome-terminal on with X over
SSH. Conveniently, in my case this is effectively none of them; we
don't install gnome-terminal on <a href="https://support.cs.toronto.edu/">our</a>
Ubuntu servers any more. We do install xterm.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/GnomeTerminalBiggerMargins?showcomments#comments">One comment</a>.) </div>Giving Gnome-Terminal some margins makes me happier with it2024-02-26T21:43:53Z2023-03-31T03:24:39Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/LinuxBlockDiscardInPracticecks<div class="wikitext"><p>I'll put the summary up front. If you have SSD based systems installed
with a reasonably modern Linux, it's pretty likely that they are
quietly automatically discarding blocks from your SSDs on a regular
basis. This is probably true even if you use software RAID mirrors
(despite <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/RAIDSSDBlockDiscardProblem">the potential problem RAID has with discarding blocks</a>).</p>
<p>To start with, you can see if your SSDs are capable of discarding
blocks with '<code>lsblk -dD</code>'. If block discard is possible, it will report
something like:</p>
<blockquote><pre style="white-space: pre-wrap;">
; lsblk -dD
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda 0 512B 2G 0
sdb 0 512B 2G 0
sr0 0 0B 0B 0
zram0 0 4K 2T 0
nvme0n1 0 512B 2T 0
nvme1n1 0 512B 2T 0
</pre>
</blockquote>
<p>But what about your software RAID arrays? You can check those too:</p>
<blockquote><pre style="white-space: pre-wrap;">
; lsblk -dD /dev/md*
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
md20 0 512B 2T 0
md25 0 512B 2G 0
md26 0 512B 2G 0
md31 0 512B 2T 0
</pre>
</blockquote>
<p>If you guessed that md20 and md31 are on the NVMe disks and md25 and
md26 are on the SATA SSDs, you're correct. All of these are mirrors.</p>
<p>On typical modern Linux systems, the actual ongoing trimming is
done by <a href="https://man7.org/linux/man-pages/man8/fstrim.8.html">fstrim</a>,
which is run from 'fstrim.service', which is trigger by 'fstrim.timer'
on a regular basis; see 'systemctl list-timers' to see if it's
enabled on your system. Typical setups have fstrim logging what it
did into the systemd journal, so you can see what it did with
'journalctl -u fstrim.service' (possibly with -r to see the most
recent runs first). Both Fedora and Ubuntu seem to enable fstrim
by default; my Fedora desktops and our 20.04 and 22.04 Ubuntu servers
all have it on.</p>
<p>Modern Linux kernels expose IO statistics about discards that have
happened on each device (since the system was last rebooted). These
are visible in /proc/diskstats, and are covered in
<a href="https://www.kernel.org/doc/Documentation/admin-guide/iostats.rst">Documentation/admin-guide/iostats.rst</a>.
Because these IO stats have been in diskstats for a while, things
that parse and extract information from diskstats may also report
them. In particular, this information is reported by <a href="https://github.com/prometheus/node_exporter">the Prometheus
host agent</a> and can
be used in a suitable Prometheus setup to see how much discarding
your various devices are doing and have been doing (including for
software RAID devices).</p>
<p>Not all filesystems support the block discarding related features
that <a href="https://man7.org/linux/man-pages/man8/fstrim.8.html">fstrim</a> needs, although ext4 and btrfs both do (for btrfs,
see their <a href="https://btrfs.readthedocs.io/en/latest/Trim.html">Trim/discard</a>
page). In particular, <a href="https://zfsonlinux.org/">ZFS on Linux</a>
doesn't support them, and so the regular fstrim.timer won't TRIM
your ZFS pools. Instead, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxTrimNotes">there are various options for doing this</a> and if you want you can do so more cautiously
than <a href="https://man7.org/linux/man-pages/man8/fstrim.8.html">fstrim</a> normally lets you. Looking at IO statistics for
discarding can confirm what filesystems do and don't support this,
especially since discard information is available for partitions.</p>
<p>Knowing that our SSDs have been TRIM'd for some time (probably
years) without any visible explosions makes me somewhat more confident
about using some sort of ZFS TRIM'ing on my desktops (our servers
don't need it right now for reasons outside the scope of this entry).
I'm still not fully confident for ZFS because <a href="https://mastodon.social/@cks/110073893690957520">while the SSDs and
regular filesystems may be well tested for TRIM, I'm not sure how
much production use ZFS TRIM has had</a>.</p>
<p>(<a href="https://mastodon.social/@cks/110070117304487616">I discovered this quiet, problem free TRIM'ing yesterday</a> and then did some
further investigation which let to discovering metrics and so on.)</p>
</div>
SSD block discard in practice on Linux systems2024-02-26T21:43:53Z2023-03-24T03:06:18Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSAndNFSFilesystemIDscks<div class="wikitext"><p>One part of a <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/NFSFilehandleInternals">NFS filehandle</a> is
an identification of the filesystem (or more accurately the mount
point) on the server. As I've seen recently there are various forms
of <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerFilesystemIDs">(NFS) filesystem IDs</a>, and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSFilehandlesClientSpecific">they can
even vary from client to client</a>
(although you shouldn't normally set things up that way). However,
all of this still leaves an open question for ZFS on Linux filesystems
in specific, which is where does the filesystem ID come from and
how can you work it out, or see if two filesystem IDs are going to
remain the same <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSAndNFSMountInvalidation">so you can substitute servers without invalidating
client NFS mounts</a>. As it happens I have
just worked out the answer to that question, so here it is.</p>
<p>All ZFS filesystems (datasets) have two identifiers, a <a href="https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html#guid">'<code>guid</code>'</a> that
is visible as a dataset properly, and a special '<code>fsid_guid</code>' (as
<a href="https://openzfs.github.io/openzfs-docs/man/8/zdb.8.html"><code>zdb</code></a>
calls it), that is the 'fsid'. There are two ways to find out the
fsid of a ZFS dataset. First, ZFS returns it to user level in the
'((f_fsid)' field that's part of what's returned by <a href="https://man7.org/linux/man-pages/man2/statfs.2.html"><code>statfs(2)</code></a>. Second, you
can use '<code>zdb</code>' to dump the objset object of a dataset, which you
may need to do if the filesystem isn't mounted. You find which
object you need to dump by getting the <a href="https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html#objsetid">'<code>objsetid</code>'</a>
property of a ZFS filesystem (well, dataset):</p>
<blockquote><pre style="white-space: pre-wrap;">
# zfs get objsetid fs6-mail-01/cs/mail
[...]
fs6-mail-01/cs/mail objsetid 148 -
# zdb -dddd fs6-mail-01 148 | grep fsid_guid
fsid_guid = 40860249729586731
</pre>
</blockquote>
<p>For the statfs() version we can use Python, and conveniently report
the result in hex for reasons we're about to see:</p>
<blockquote><pre style="white-space: pre-wrap;">
>>> import os
>>> r = os.statvfs("/w/435").f_fsid
>>> print("%x" % r)
34ae17341d08c
</pre>
</blockquote>
<p>(These two approaches give the same answer for a given filesystem.)</p>
<p>In <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerFilesystemIDs">my earlier exploration of NFS server filesystem IDs</a>, the NFS export '<code>uuid</code>' of this /w/435
test filesystem was '7341d08c:00034ae1:00000000:00000000', which
should look awfully familiar. It's the low 32-bit word and the
high 32-bit word of the '<code>f_fsid</code>', in that order, zero-padded.
The reason for this reversal is somewhat obscure and beyond the
scope of this entry (but it's probably <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/zfs_vfsops.c#L1139">this setting of the peculiar
f_fsid field</a>
in <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/zfs_vfsops.c#L1090">zfs_statvfs()</a>).</p>
<p>(This is the uuid visible in /proc/fs/nfsd/exports. <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerFilesystemIDs">As I
discovered earlier</a>, the version in
/proc/net/rpc/nfsd.fh/content will be different.)</p>
<p>One important thing here is that <strong>a filesystem's fsid is not copied
through ZFS send and receive</strong>, presumably because it's an invisible
attribute that exists at the wrong level. This means that if you do
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSAndNFSMountInvalidation">ZFS fileserver upgrades by (filesystem) migration</a>, your new fileserver will normally
have ZFS filesystems with different ZFS fsids and thus different
NFS filesystem IDs than your old one, and your NFS clients will get
stale NFS handle errors. But at least you can now check this in
advance if you want to verify that this is so. You can't work around
this at the ZFS level, but you might be able to fix it at the NFS
export level by setting an explicit '<code>uuid=</code>' (of the old value)
for all of the exports of the moved filesystem. Locally, we're just
going to unmount and remount.</p>
<p>(I suspect that if you used '<a href="https://openzfs.github.io/openzfs-docs/man/8/zpool-split.8.html"><code>zpool split</code></a>' to
split a pool the two copies of the pool would have filesystems with
the same fsids and thus you could then do a migration from one to
the other. But I've never even come near doing a ZFS pool split,
so this is completely theoretical. For a server upgrade, presumably
you'd use some sort of remote disk system like iSCSI or maybe <a href="https://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device">DRBD</a> to
temporarily attach the new server's disks as additional mirrors,
then split them off.)</p>
</div>
ZFS on Linux and NFS(v3) server filesystem IDs2024-02-26T21:43:53Z2023-03-22T02:20:02Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/DefaultTerminalProgramcks<div class="wikitext"><p>Recently over on the Fediverse, someone who's coming back to Linux
asked what terminal program one should use these days. After thinking
about it, I realized that my answer had to be <a href="https://help.gnome.org/users/gnome-terminal/stable/">Gnome Terminal</a> (<a href="https://en.wikipedia.org/wiki/GNOME_Terminal">also</a>), at least for the
kind of person who's asking this question without any particular
additional qualifiers.</p>
<p>I'm personally very attached to the venerable xterm, but xterm
is an acquired taste with various issues in practice, so I can't
recommend it to a new person. If you want xterm, you already know
it (and you probably have several reasons why). Out of the various
alternatives, I think Gnome Terminal is the default choice for two
reasons.</p>
<p>First, Gnome Terminal is inoffensive and basically works. I believe that
it has all of the features you'd expect from a modern terminal emulator,
and if some aspect of its behavior or appearance isn't entirely to
your liking, it supports a reasonable amount of customization. If I
had to switch to Gnome Terminal for some reason I could probably get
by reasonably well, even though various differences from xterm would
irritate me for ages. Gnome Terminal is a perfectly reasonable and
functional terminal program, which is the basic requirement.</p>
<p>Second, Gnome Terminal is pervasive; it's the default Gnome terminal
program, and Gnome is more or less the default Linux desktop. Because
of this, pretty much everyone is going to test things with it
and make sure that they work okay, both in appearance and in
performance. For example, a Linux distribution is certainly going to
make sure that its choice of default monospace font works okay in
Gnome Terminal; <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Fedora36FontconfigMystery">your mileage may vary in other terminal programs</a>. Similarly, <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/TerminalColoursNotTheSame">text colours vary between
terminal programs</a> but people are
almost certainly going to make sure that their program's use of colours
looks decent in Gnome Terminal. This means that you're less likely to
run into irritations with Gnome Terminal. And if something does explode,
if you're using Gnome Terminal in a standard environment (such as
Gnome itself), then fixing it should be a high priority for your Linux
distribution since a lot of people will be affected.</p>
<p>You can make a case for KDE's konsole on much the same reasons, but
I think KDE and konsole are less widely used and so you're more
likely to run into issues in distributions and with programs. You
can get rid of the distribution issues by using a Linux distribution
(not an alternate 'spin' of a distribution) that focuses on KDE,
which will likely take care to make sure their choice of default
fonts works well with konsole and so on. I'm not sure there are very
many of these left, though.</p>
<p>(This elaborates on <a href="https://mastodon.social/@cks/110052173289545829">my Fediverse reply</a>.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DefaultTerminalProgram?showcomments#comments">5 comments</a>.) </div>Today the default choice for a terminal program is Gnome Terminal2024-02-26T21:43:53Z2023-03-21T02:30:05Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NFSFilehandlesClientSpecificcks<div class="wikitext"><p>Under normal circumstances, we assume that NFS servers give out the
same <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/NFSFilehandleInternals">NFS filehandle</a> for a given
file (or directory or etc) to every NFS client. On Linux, this is
not necessarily the case, although it usually will be.</p>
<p>To illustrate this, I'm going to get a second filehandle for the
same NFS export as I did in <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerFilesystemIDs">my entry on NFS filesystem IDs</a>, using /proc/fs/nfsd/filehandle (cf
<a href="https://man7.org/linux/man-pages/man7/nfsd.7.html">nfsd(7)</a>):</p>
<blockquote><pre style="white-space: pre-wrap;">
>>> f = open("filehandle", mode="r+")
>>> f.write("128.100.x.x /w/435 128\n")
>>> r = f.read(); print(r)
\x 01 00 01 00 efbeadde
</pre>
</blockquote>
<p>This is not what we got for /w/435's filehandle in <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerFilesystemIDs">the earlier
entry</a>, which was '\x 01 00 06 00 7341d08c
00034ae1 00000000 00000000' (embedding the normal kernel NFS server
'uuid' of the filesystem).</p>
<p>The structure of this block of hex comes from <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/nfsd/nfsfh.h">fs/nfsd/nfsfh.h</a>.
This is a version 1 filehandle, an ignored '0' auth type byte, a
type 1 fsid, and a fileid that is '<code>FILEID_ROOT</code>' (0), with an
odd looking rest of the data. If we look at /proc/net/rpc/nfsd.fh/content
we can see another version of this:</p>
<blockquote><pre style="white-space: pre-wrap;">
#domain fsidtype fsid [path]
@nfs_ssh 6 0x8cd04173e14a03000000000000000000 /w/435
128.100.X.X 1 0xdeadbeef /w/435
</pre>
</blockquote>
<p>The actual fsid type is a clue as to what's going on here; it is
'<code>FSID_NUM</code>', meaning a four byte user specified identifier, also
known as the <code>fsid=</code> field in <a href="https://man7.org/linux/man-pages/man5/exports.5.html">exports(5)</a>. In this
case the user specified identifier is 0xdeadbeef (decimal 3735928559,
or -559038737 in /proc/fs/nfsd/export), encoded in the filehandle in
a peculiar way.</p>
<p>The ultimate cause of this is <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSExportPermsModel">Linux's NFS export permissions model</a>. In many NFS servers, export settings are
attached to the export point, such as /w/435, and these settings
include what clients have access and so on. In Linux, you have
things, such as netgroups, that have a collection of export settings
for a particular export point. This creates a natural model for
giving different clients different sets of permissions and attributes,
but it also means that all export attributes are per-client, including
ones such as <code>fsid=</code>. And since the filesystem id is necessarily
part of the NFS filehandle, NFS filehandles as a whole can be
different between different clients.</p>
<p>It's probably not very sensible to give different clients a different
filesystem identifier for the same NFS export. But it's technically
allowed, and the Linux kernel NFS server will play along if you do
this. I haven't tested what happens if you give the NFS server back
the 'wrong' filehandle (ie, if a @nfs_ssh machine gives the
kernel a filehandle issued for 128.100.X.X).</p>
<p>(There are some operational reasons to accept such wrong filehandles,
for example if 128.100.X.X is initially not part of @nfs_ssh
but then gets added to it. On the other hand, not accepting the
wrong version of a filehandle is arguably more secure if you have
specifically set different filesystem IDs for different clients.)</p>
<p>PS: To make /proc/fs/nfsd/filehandle work, the relevant client or
sort of client has to have mounted the filesystem, or perhaps there's
some other way to push the necessary information from mountd into
the kernel (cf <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSExportPermsHandling">how mountd and export handle NFS permissions</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSFlushingServerAuthCache">how to see and flush the kernel's
NFS server authentication cache</a>).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSFilehandlesClientSpecific?showcomments#comments">2 comments</a>.) </div>NFS filehandles from Linux NFS servers can be client specific2024-02-26T21:43:53Z2023-03-17T03:11:28Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdJournalctlSearchingcks<div class="wikitext"><p>Yesterday I wrote about <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLokiWhatILikeItFor">how I like using Grafana Loki for narrow
searches of our logs</a>. In
the process of writing that entry, it occurred to me that systemd's
<a href="https://www.freedesktop.org/software/systemd/man/journalctl.html"><code>journalctl</code></a> might
have some search features too that I could use as an alternative
to Loki. The answer is that that yes, a modern journalctl does (or
at least probably does, since some of this depends on its build
options).</p>
<p>By now, hopefully everyone knows about <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/JournalctlShowOneUnit">using '<code>journalctl -u
<what></code>' to show logs from only a single service</a>,
and also '<code>journalctl --since ..</code>', which takes both absolute and
relative times in a convenient syntax (there's also '<code>--until</code>',
to restrict to a time range, but generally I only use one and just
stop looking after a certain point). If you're fishing for the
systemd unit associated with a log message, you can use '<code>journalctl
-o <a href="https://www.freedesktop.org/software/systemd/man/journalctl.html#with-unit">with-unit</a></code>',
although this won't always show you the answer. If you're using
'with-unit', you may also want <a href="https://www.freedesktop.org/software/systemd/man/journalctl.html#--no-hostname">'--no-hostname'</a>
so the output is less cluttered.</p>
<p>The big additional option is '<a href="https://www.freedesktop.org/software/systemd/man/journalctl.html#-g"><code>journalctl -g</code></a>',
aka --grep, which does what you'd expect; it takes a (<a href="https://pcre.org/current/doc/html/pcre2pattern.html">Perl-compatible</a>) regular
expression and shows you logs where the message matches the regular
expression. This match can be case sensitive or case insensitive.</p>
<p>In other selection options besides -u, you can get the kernel
messages with '<a href="https://www.freedesktop.org/software/systemd/man/journalctl.html#-k"><code>journalctl -k</code></a>',
logs for a particular syslog identifier with '<a href="https://www.freedesktop.org/software/systemd/man/journalctl.html#-t"><code>journalctl -t</code></a>',
logs for a particular syslog priority (or message priority) with
'<a href="https://www.freedesktop.org/software/systemd/man/journalctl.html#-p"><code>journalctl -p</code></a>', and
for a syslog facility with '<a href="https://www.freedesktop.org/software/systemd/man/journalctl.html#--facility="><code>journalctl --facility</code></a>'.
Conveniently, if you specify a single priority (aka log level), you
get that priority or more important (which is called 'lower' for
reasons to do with how syslog priorities are represented). These
can be combined, so you can write:</p>
<blockquote><pre style="white-space: pre-wrap;">
journalctl -r --facility daemon -p notice
</pre>
</blockquote>
<p>Journalctl can also match against specific message fields, although
it looks like there's little or no wild card support. If you match
on multiple fields, all fields must match. There are two ways to
find out what fields and field values you have available. First,
you can use '<a href="https://www.freedesktop.org/software/systemd/man/journalctl.html#-N"><code>journalctl -N</code></a>' to
get the names of all fields (which are returned in a random, unsorted
order), and then '<a href="https://www.freedesktop.org/software/systemd/man/journalctl.html#-F"><code>journalctl -F ...</code></a>' to
see all of the values of a particular field (again, unsorted). The
well known fields and their meanings are covered in
<a href="https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html#">systemd.journal-fields</a>.
Once you have the field and field value of interest, you can then do
eg:</p>
<blockquote><pre style="white-space: pre-wrap;">
journalctl -r _TRANSPORT=syslog
</pre>
</blockquote>
<p>The other way is to dump some journal entries in JSON format, run
them through jq, and see what you get. You can optionally restrict
this to certain fields:</p>
<blockquote><pre style="white-space: pre-wrap;">
journalctl -o json -r | jq . | less
journalctl -o json -r | jq '[._CMDLINE, .MESSAGE]' | less
</pre>
</blockquote>
<p>You can use any of the filtering options to cut down how many messages
you have to pick through in JSON format. Unfortunately, '<code>journalctl
-F ...</code>' doesn't accept any options to narrow things down, so you can't
do handy things like see what executables are recorded for a particular
service. If you want that, you can do something like:</p>
<blockquote><pre style="white-space: pre-wrap;">
journalctl -u crond.service --since -31d -o json |
jq -r '._EXE' | sort -u
</pre>
</blockquote>
<p>I don't know if the indexing information necessary to determine this
is part of the systemd journal index; if it's not, this sort of thing
may be the best you can do within systemd.</p>
<p>PS: It's possible to forward the systemd journal using things like
<a href="https://www.freedesktop.org/software/systemd/man/systemd-journal-remote.service.html">systemd-journal-remote</a>
and <a href="https://www.freedesktop.org/software/systemd/man/systemd-journal-gatewayd.html#">systemd-journal-gatewayd</a>,
but it seems much more work to set up, especially if you want some
security. <a href="https://support.cs.toronto.edu/">We</a> may experiment with
this someday (it would go well with <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/CentralizeSyslog">our central syslog server</a>), but probably not any time soon.</p>
<h3>Sidebar: (re)formatting journalctl output a bit</h3>
<p>In theory journalctl allows you to control what fields are printed,
with '<a href="https://www.freedesktop.org/software/systemd/man/journalctl.html#--output-fields="><code>journalctl --output-fields ...</code></a>'.
In practice this is not particularly useful for two reasons. First,
you have to use this with a special output format, generally either
<a href="https://www.freedesktop.org/software/systemd/man/journalctl.html#verbose">'verbose'</a>
or <a href="https://www.freedesktop.org/software/systemd/man/journalctl.html#cat">'cat'</a>. If
you use 'verbose', you get a conveniently formatted timestamp but
also a chunk of forced contents because of the extra fields that
are always included (in an encoded form). If you use 'cat', you get
nothing for free, including formatted timestamps; you need to include
the '<code>SYSLOG_TIMESTAMP</code>' field and sort of hope. Second, regardless
of what output format you choose you get each field on a line by
itself, with no option to format them all on one line.</p>
<p>As a result, under most situations I think you're probably better
off using JSON output and then reaching for <a href="https://stedolan.github.io/jq/"><code>jq</code></a> to reformat things into a useful
text format (see <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/JqFormattingTextNotes">my notes about formatting text with <code>jq</code></a>). You can probably use <code>jq</code>
to reformat journalctl's raw timestamps into useful time formats,
too. If you're doing very much with this you're probably going to
wind up putting the whole thing in a script, unless you're much
better at on the fly <code>jq</code> command lines than I am.</p>
<p>PPS: I can't blame journalctl too much for not providing a general
facility for formatting its output lines. Formatting output is a
potentially complex subject, and journalctl exists in a world with
tools like <a href="https://stedolan.github.io/jq/"><code>jq</code></a>. In the Unix tradition, it's fine to defer
extensive reformatting to other programs.</p>
<p>(This is where I wish awk would read JSON.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdJournalctlSearching?showcomments#comments">2 comments</a>.) </div>Some notes on searching the systemd journal with journalctl2024-02-26T21:43:53Z2023-03-15T01:26:35Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NFSServerFilesystemIDscks<div class="wikitext"><p><a href="https://utcc.utoronto.ca/~cks/space/blog/unix/NFSFilehandleInternals">NFS(v3) filehandles</a> are how NFS
clients tell the NFS server what they're operating on, and one part
of the filehandle is a 'fsid', a filesystem ID (or UUID). The Linux
kernel NFS server normally automatically determines the fsid to use
itself, but you can explicitly set it in your NFS exports, per
<a href="https://man7.org/linux/man-pages/man5/exports.5.html"><code>exports(5)</code></a>,
and doing so may let you move filesystems from one actual server
to another without clients having to unmount and remount the
filesystem (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSAndNFSMountInvalidation">as I recently discovered we needed to do in one ZFS
fileserver upgrade situation</a>). This
makes the question of what the fsid of your NFS exported filesystems
a matter of some potential interest, and also where it comes from.</p>
<p>Based on a comment from Arnaud Gomes on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSAndNFSMountInvalidation">yesterday's entry</a>, the answer to where you can observe
the fsid of existing NFS exports turns out to be <code>/proc/fs/nfsd/exports</code>.
Filesystems only show up here once someone actually mounts them, but when
they do you'll get lines like this (on Ubuntu 22.04):</p>
<blockquote><pre style="white-space: pre-wrap;">
# Version 1.1
# Path Client(Flags) # IPs
/w/435 @nfs_ssh(rw, root_squash, sync, wdelay, no_subtree_check, uuid=7341d08c:00034ae1:00000000:00000000, sec=1)
</pre>
</blockquote>
<p>(Except with no spaces after the commas, I put them in to allow your browser
to wrap the line.)</p>
<p>Similar contents can be found in <code>/proc/net/rpc/nfsd.export/content</code>,
if you prefer looking there instead. In this output, I believe the
'<code>uuid</code>' field is the fsid, although I don't know if <a href="https://man7.org/linux/man-pages/man8/exportfs.8.html"><code>exportfs</code></a> will accept
this name for it or if you have to write it as 'fsid=...' in your
exports.</p>
<p>(I believe that a plain numeric 'fsid' is quite different from a
UUID here; in fact, I believe they'll generate completely different
NFS filehandles. You really need to specify the exact UUID and have
the kernel accept it as a UUID in order to get the same NFS
filehandle.)</p>
<p>Based on a quick look at the kernel source, it appears that there
are a number of different types of fsids. To find out what type
your particular filesystem and mount point is using, you can look
in <code>/proc/net/rpc/nfsd.fh/content</code>:</p>
<blockquote><pre style="white-space: pre-wrap;">
#domain fsidtype fsid [path]
@nfs_ssh 6 0x8cd04173e14a03000000000000000000 /w/435
</pre>
</blockquote>
<p>The definitions of the fsidtype numbers are in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/nfsd/nfsfh.h">fs/nfsd/nfsfh.h</a>,
and here '6' is '<code>FSID_UUID16</code>', a 16-byte UUID. As we see here,
this UUID appears to be flipped around from the version listed in
<code>exports</code> in a somewhat complex way. In full filehandles, there are
also different types of 'fileid', which are covered in
<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/exportfs.h">include/linux/exportfs.h</a>.
At the moment, <a href="https://zfsonlinux.org/">ZFS on Linux</a> appears to
use only '<code>FILEID_INO32_GEN</code>' ('1'), which has a 32-bit inode
number and a 32-bit generation number.</p>
<p>(If you simply specify a '<code>fsid=</code>' plain number in your exports, I
suspect you get a fsidtype of '<code>FSID_NUM</code>', aka 1.)</p>
<p>With ZFS (and probably other filesystems), if you export a subdirectory
of a filesystem instead of the root of the filesystem, what you get is
a fsidtype of 7, '<code>FSID_UUID16_INUM</code>', which is the 16-byte UUID plus
an 8-byte inode number. The 8-byte inode number appears on the front
of the UUID in <code>/proc/net/rpc/nfsd.fh/content</code> and appears to be the
visible inode number that '<code>ls -ldi</code>' will tell you.</p>
<p>As documented in <a href="https://man7.org/linux/man-pages/man7/nfsd.7.html">nfsd(7)</a>, it's possible
to use a special interface in <code>/proc/fs/nfsd</code> to get the full
filehandle for a given file in an exported filesystem, using the
'<code>filehandle</code>' file. I'll show an example in interactive Python:</p>
<blockquote><pre style="white-space: pre-wrap;">
>>> f = open("filehandle", mode="r+")
>>> f.write("@nfs_ssh /w/435/cks 128\n")
>>> r = f.read()
>>> print(r)
\x01 00 06 01 7341d08c 00034ae1 00000000 00000000 0a000200000000001d000000
</pre>
</blockquote>
<p>The actual output has no spaces, but I've broken it up to show some
of the observable structure, which comes from <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/nfsd/nfsfh.h">fs/nfsd/nfsfh.h</a>.
This is version 1, an auth type that is ignored and may always be
0, a type 6 fsid, a type 1 fileid, the 16-byte filesystem UUID as
shown in the same format as in the exports, and then a blob that
I'm not going to try to decode into its component parts because
that would require too much digging in the ZFS code.</p>
<p>(Interested parties can start with the fact that the observable
inode number for this directory is '2'.)</p>
<p>You get an interestingly different filehandle for the root of the
exported filesystem. I'll show it decoded again:</p>
<blockquote><pre style="white-space: pre-wrap;">
\x 01 00 06 00 7341d08c 00034ae1 00000000 00000000
</pre>
</blockquote>
<p>This is still a type 6 fsid, but now the fileid is '<code>FILEID_ROOT</code>'
(0), the root of the exported filesystem. Since the root is unique, we
only have the filesystem UUID; there's no extra information. Well,
more exactly we probably have the filesystem 'fsid' in the sense of
<code>/proc/net/rpc/nfsd.fh/content</code>, which here is the filesystem UUID.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerFilesystemIDs?showcomments#comments">5 comments</a>.) </div>Some bits on Linux NFS(v3) server filesystem IDs (and on filehandles)2024-02-26T21:43:53Z2023-03-11T04:03:49Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSAndNFSMountInvalidationcks<div class="wikitext"><p>Suppose that you have <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">ZFS based NFS servers</a>
that you're changing from Ubuntu 18.04 to 22.04. These servers have
a lot of NFS exported filesystems that are mounted and used by a
lot of clients, so it would be very convenient if you could upgrade
the ZFS fileservers without having to unmount and remount the
filesystems on all of your clients. Conversely, if a particular way
of moving from 18.04 to 22.04 is going to require you to unmount
all of its filesystems, you'd like to know that in advance so you
can prepare for it, rather than find out after the fact when clients
start getting 'stale NFS handle' errors. Since we've just been
through some experiences with this, I'm going to write down what
we've observed.</p>
<p>There are at least three ways to move a ZFS fileserver from Ubuntu
18.04 to Ubuntu 22.04. I'll skip upgrading it in place because we
don't have any experience with that; <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WhyNotInplaceOSUpgrades">we upgrade machines by
reinstalling them from scratch</a>. That
leaves two approaches for a ZFS server, which I will call a <em>forklift
upgrade</em> and a <em>migration</em>. In a forklift upgrade, you build new
system disks, then swap them in by exporting the ZFS pools, changing
system disks, booting your new 22.04 system, and importing the pools
back.</p>
<p>(As a version of the forklift upgrade you can reuse your current
system disks, although this means you can't readily revert.)</p>
<p>Our experience with these in place 'export pools, swap system disks,
import pools' forklift upgrades is that client NFSv3 mounts survive
over them. Your NFS clients will stall while your ZFS NFS server
goes away for a while, but once it's back (under the right host
name and IP address), they resume their activities and things pick
right back up where they were. We've also had no problems with ZFS
pools when we reboot our servers with changed hostnames; changing
the server's hostname doesn't cause ZFS on Linux to not bring the
pools up on boot.</p>
<p>However, forklift upgrades can only be done on ZFS fileservers where
you have separate system disks and ZFS pool disks. <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/LocalVarMailImprovement">We have one
fileserver where this isn't possible</a>;
it has only four disks and shares all of them between system
filesystems and its ZFS pool. For this machine we did a <em>migration</em>,
where we built a new version of the system using new disks on new
hardware, then moved the ZFS data over with ZFS snapshots (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/UpgradingMachinesWithState">as I
thought we might have to</a>).
Once the data was migrated, we shut down the old server and made
the new hardware take over the name, IP address, and so on.</p>
<p>Unfortunately for us, when we did this migration, NFS clients got
stale NFS mounts. The new version of this fileserver had the same
filesystem with the exact same contents (ZFS snapshots and snapshot
replication insures that), the same exports, and so on, but <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/NFSFilehandleInternals">the
NFS filehandles</a> came out different.
It's possible that we could have worked around this if we had set
an explicit '<code>fsid=</code>' value in our NFS export for the filesystem
(as per <a href="https://man7.org/linux/man-pages/man5/exports.5.html"><code>exports(5)</code></a>), but it's
also possible that there were other differences in the NFS filehandle.</p>
<p>(ZFS has a notion of a 'fsid' and a 'guid' for ZFS filesystems
(okay, datasets), and zdb can in theory dump this information, but
right now I can't work out how to go from a filesystem name in a
pool to reading out its ZFS fsid, so I can't see if it's preserved
over ZFS snapshot replication or if the receiver generates a new
one.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSAndNFSMountInvalidation?showcomments#comments">One comment</a>.) </div>ZFS on Linux and when you get stale NFSv3 mounts2024-02-26T21:43:53Z2023-03-10T03:38:51Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/DebconfWhiptailVsXtermcks<div class="wikitext"><p>Every so often I install or upgrade a package by hand on one of
<a href="https://support.cs.toronto.edu/">our</a> Ubuntu servers and the
package stops to ask me questions, because <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/UbuntuUpdateProcessDislike">that is a thing that
Debian packages can do</a>. Usually this
is pretty close to fatal, because <a href="https://mastodon.social/@cks/109982648901372273">in my normal xterm environment,
the default interactive interface Debconf uses for this doesn't
work</a>. Specifically,
there is no way to see what the current selection theoretically is,
which leaves me flying blind in picking an answer.</p>
<p>The ultimate cause for this turns out to be that <strong>the <a href="https://manpages.debian.org/bullseye/whiptail/whiptail.1.en.html"><code>whiptail</code></a>
program doesn't work in an <a href="https://invisible-island.net/xterm/">xterm</a>
that has colour turned off</strong>. Whiptail is <a href="https://manpages.debian.org/bullseye/debconf-doc/debconf.7.en.html#Frontends">the default program
used for the default 'dialog' debconf frontend</a>
(<a href="https://kolektiva.social/@Anarcat/109982789634101272">thanks to @anarcat for telling me about this</a>). Contrary
to what I thought before I tried it, whiptail doesn't intrinsically
require colour, as it will work if you claim your xterm is, say, a
VT100 (with eg '<code>export TERM=vt100</code>'). The alternative <a href="https://manpages.debian.org/bullseye/dialog/dialog.1.en.html"><code>dialog</code></a>
program works fine if your xterm has had its colours forced off,
and <a href="https://manpages.debian.org/bullseye/debconf-doc/debconf.7.en.html#DEBCONF_FORCE_DIALOG">you can force debconf to use dialog instead of whiptail</a>.</p>
<p>(In a terminal environment that it thinks can do colour, whiptail relies
on colour to highlight your selection so you know what it is. If the
terminal is not actually displaying colour, this goes badly.)</p>
<p>Xterm is relatively unique in X terminal programs in that it supports
text colours but allows you to turn them off at runtime as a command
line option (or an X resource setting, <a href="https://mastodon.social/@cks/109982924464399854">which is what I use</a>). I disable terminal
colours whenever I can because they're almost always hard for me to
read, especially in the generally rather intense colour set that xterm
uses (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/TerminalColoursNotTheSame">X terminal programs aren't consistent about what text colours
look like</a>, so the experiences of
people using Gnome Terminal are different here). Unfortunately, once
you've started xterm with colours off, as far as I know there's no way
to turn them back on.</p>
<p>(There is probably some escape sequences that can be used to query
xterm to see if it currently supports colours. I suspect that my odds
of getting the authors of <a href="https://manpages.debian.org/bullseye/whiptail/whiptail.1.en.html"><code>whiptail</code></a> to use them are functionally
zero.)</p>
<p>There are an assortment of manual workarounds, such as setting
various environment variables before running apt-get. The practical
problem is that, <a href="https://mastodon.social/@cks/109982824003312269">to quote myself from the Fediverse</a>:</p>
<blockquote><p>The broad problem is that Ubuntu and Debian package installs/updates
infrequently and irregularly ambush me with this and the default
configuration doesn't work. If I expect it I have many workarounds,
but generally I don't. And I'll never remember to always, 100% of the
time deploy the workarounds on all of our servers all of the time, no
matter what I'm doing.</p>
</blockquote>
<p>In theory debconf supports not even asking you questions, in the
form of <a href="https://manpages.debian.org/bullseye/debconf-doc/debconf.7.en.html#noninteractive">the <code>noninteractive</code> frontend</a>.
In practice I don't have enough confidence in Debian packages or
especially Ubuntu's version of them behaving sensibly when they're
forced into non-interactive mode. The very nature of being able to
ask questions means that people don't necessarily feel compelled
to make the default answer a sensible one.</p>
<p>Possibly the right answer for us is to deploy a general system
setting on our servers to prefer <a href="https://manpages.debian.org/bullseye/dialog/dialog.1.en.html"><code>dialog</code></a> over <a href="https://manpages.debian.org/bullseye/whiptail/whiptail.1.en.html"><code>whiptail</code></a>.
Unfortunately Ubuntu doesn't want you to remove the 'whiptail'
package itself; it's a dependency of the 'ubuntu-minimal' package,
and I don't really feel like finding out what effects stripping out
core looking 'ubuntu-<etc>' packages have. Another option is for
me to configure xterm to set the '<code>$TERM</code>' environment variable to
'xterm-mono', which I expect exists on most Unix systems I'm likely
to use (or perhaps the older name 'xtermm', which is also on OpenBSD).
This version of xterm's <a href="https://man7.org/linux/man-pages/man5/terminfo.5.html">terminfo</a> capabilities
lacks colour entries entirely, and <a href="https://manpages.debian.org/bullseye/whiptail/whiptail.1.en.html"><code>whiptail</code></a> works fine with
it.</p>
<p>(I'm not intrinsically opposed to colours, but I am opposed to
blinding or hard to read colour choices, and a great deal of the
colours that programs try to use in terminal windows wind up that
way. The default colour set used by GNU Emacs for code highlighting
generally comes across to me as fairly nice, for example.)</p>
<p>PS: One way to see if your current terminal type claims to support
colours is '<code>tput colors</code>' (<a href="https://unix.stackexchange.com/a/10065">cf</a>).
In my regular xterms, this reports '8' (the basic number of ANSI
colours), while '<code>tput -T xterm-mono colors</code>' reports '-1', ie 'no'.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DebconfWhiptailVsXterm?showcomments#comments">3 comments</a>.) </div>Debconf's questions, or really whiptail, doesn't always work in xterms2024-02-26T21:43:53Z2023-03-09T04:12:30Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdDynamicUserNFSAndGroupscks<div class="wikitext"><p>Today <a href="https://mastodon.social/@cks/109950128454996029">I used bind mounts in an odd way</a>:</p>
<blockquote><p>Today's crazy Linux bind mount usage: <br>
# cp [-a] /a/nfs/mount/special-file /root/special-file <br>
# mount --bind /root/special-file /a/nfs/mount/special-file</p>
<p>This was the easiest way to make a systemd service with
DynamicUser=yes and a supplementary group get access to special-file,
which is only accessible by said group. (The normal version of the
service runs with the file not on NFS.)</p>
<p>I assume something about filesystem visibility for systemd dynamic
users but meh, life is short and I have a hammer.</p>
</blockquote>
<p>(Oops, I see I left out a critical '-a' cp argument in my initial
Fediverse post, <a href="https://mastodon.social/@cks/109951546087080567">cf</a>.)</p>
<p>Linux's bind mounts are normally used with directories, but it's
equally valid to bind mount a file, as I'm doing here. By bind-mounting
the special file the service needs to access to a local file, I'm
taking NFS out of the picture. This turned out to be the right
answer (and in fact the only good one), but not for the reasons
that I thought.</p>
<p>This particular service uses <a href="https://www.freedesktop.org/software/systemd/man/systemd.exec.html#DynamicUser="><code>DynamicUser=yes</code></a>
because it's a combination of <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdDynamicUserLike">more convenient and more secure</a>. Because things run by the service need
to read a private file, the service also has <a href="https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SupplementaryGroups=">a supplementary group</a>;
the private file is in the group, and the service has access to the
group. In the production deployment, this file lives on a local
filesystem; here, I was running a test setup, where having it on
NFS is more convenient. At first, I assumed that <a href="https://www.freedesktop.org/software/systemd/man/systemd.exec.html#DynamicUser="><code>DynamicUser=yes</code></a>
was manipulating NFS mount related things so that the supplementary
group was ignored (it wasn't completely blocking NFS mount access,
because other things the service was using came from the same NFS
mount), but this isn't the problem. Instead, the problem is on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our
Linux NFS servers</a>.</p>
<p>Like many other people, our Linux NFS servers are configured to
allow people to (meaningfully) be in more than 16 groups, which is
<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/GroupLimitState">the NFS v3 protocol limit</a>. On Linux
NFSv3 servers, how this works is that <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSFlushingServerGroupCache">the NFS server throws away
the group list from the NFS client and does its own local lookup</a>. We have a synchronized password file,
so for regular logins and groups the NFS servers have the same UIDs
and GIDs as the NFS clients (including for the supplemental group
used here) and this all works out. However, when you set
<a href="https://www.freedesktop.org/software/systemd/man/systemd.exec.html#DynamicUser="><code>DynamicUser=yes</code></a>, systemd makes up a new UID (and GID) that
doesn't exist in your local /etc/passwd and so won't exist in the
NFS server's /etc/passwd either. When a process in the service makes
NFS requests, the NFS server takes the carefully curated list of
supplemental groups you set up in systemd, throws them away, looks
up the UID in its own /etc/passwd and /etc/group, finds nothing,
and concludes that this request has no group permissions at all.</p>
<p>(Indeed, now that I look I can see the telltale '<uid> 0:' line in
the NFS server's /proc/net/rpc/auth.unix.gid/content, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSServerUsingGroupCache">cf</a>. Along with a few other unknown UIDs
that we're seeing from somewhere.)</p>
<p>When I used a bind mount to make the special file a local file, not
a NFS file, I bypassed the NFS server and with that, the NFS server
ignoring the local supplemental group. Once all of the access control
for the file was being done locally, by the client's kernel, the
supplemental group worked to allow access. I believe this was the
only way to solve the problem without changing the service unit.</p>
<p>So the end moral is <strong>supplemental groups don't work over NFSv3 with
systemd dynamic users</strong>. More generally, supplemental groups with
anonymous UIDs don't work over NFS; systemd dynamic users are merely one
way to get anonymous UIDs. For our uses this isn't a fatal problem, but
I'll want to remember it for the future.</p>
<p>(The workaround would be to allocate an actual UID for this purpose,
set it in the systemd unit file, and then possibly duplicate all of
the additional things that <a href="https://www.freedesktop.org/software/systemd/man/systemd.exec.html#DynamicUser="><code>DynamicUser=yes</code></a> normally does that
increase security and isolation.)</p>
<p>(I realized <a href="https://mastodon.social/@cks/109950562210109751">what the answer was</a> due to <a href="https://mas.to/@srtcd424/109950531058623298">a
suggestion from Steven Reid</a>.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdDynamicUserNFSAndGroups?showcomments#comments">One comment</a>.) </div>A gotcha with Systemd's <code>DynamicUser</code>, supplementary groups, and NFS (v3)2024-02-26T21:43:53Z2023-03-02T03:27:02Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/UbuntuCanonicalProductcks<div class="wikitext"><p>A while back I wrote that <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/UbuntuIsCanonical">from an outside perspective, Ubuntu is
Canonical's thing</a>, in that Canonical runs the
show despite having outside contributors. But in the wake of
<a href="https://mastodon.social/@cks/109944814025438841">wrestling with Canonical's advertisements in a stock 22.04 LTS
machine and losing</a>,
I want to amend that observation with an important additional one.
Ubuntu is not merely Canonical's, Ubuntu is a Canonical product.
Which is to say, <strong>Ubuntu exists to make money for Canonical</strong>.
Further, the current evidence suggests that Canonical feels it's
not making enough money for them; hence the steadily increasing
advertisements in Ubuntu, <a href="https://www.theregister.com/2023/02/23/ubuntu_remixes_drop_flatpak/">along with other moves</a>.</p>
<p>Broadly speaking, we've seen this show before, most recently with
Red Hat/IBM and CentOS, so we can make some guesses about where
this version will go. If Canonical is now making enough money from
Ubuntu, they might stop here, with annoying things in your message
of the day and so on. Otherwise, they will definitely take additional
steps to make more money, and they probably have a number of those.
Would Canonical reduce the free LTS support interval from five years
to two and a half years? Perhaps. And fundamentally Canonical is
unlikely to be that interested in the views of people who have
little or no chance of giving them money, people like <a href="https://support.cs.toronto.edu/">us</a>.</p>
<p>(A shortened free LTS support period wouldn't be the death knell of
personal use of Ubuntu LTS, since Canonical currently gives free
personal use licenses for their paid extra support.)</p>
<p>The good news is that the sky isn't falling today; there's no particular
need to move away from Ubuntu for current or future use. The other good
news is that because Ubuntu is so close to Debian, it will probably
be pretty easy to move to using Debian for future machines if the sky
does fall in. I'd expect almost all of <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/UbuntuOurInstallSystem">our local customizations to
the Ubuntu server installs</a> to drop right
in on top of Debian. The one area that will be different is the
installer itself, since <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Ubuntu2004AutoinstFormat">Ubuntu uses a new installer since 20.04</a>.</p>
<p>(Energetic and concerned people might thus start building out a Debian
installer environment, or at least explore it to build up their
knowledge.)</p>
<p>Locally, <a href="https://support.cs.toronto.edu/">we</a>'re unlikely to
migrate away from Ubuntu LTS until we're forced to, because we
continue to like the predictable release schedule and five years
of support. However, I expect we'll be keeping in contact with
anyone else around here who's switched over to Debian, so we can
find out how they feel about the shift.</p>
<p>PS: The other thing that can happen with commercial products is
that they stop being made (or they get sold and drastically
transformed). On the sale front, I can imagine a future where Ubuntu
becomes, say, 'AWS Ubuntu' after Amazon buys out Canonical at a
suitably low price.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/UbuntuCanonicalProduct?showcomments#comments">8 comments</a>.) </div>Ubuntu is a Canonical product2024-02-26T21:43:53Z2023-03-01T03:05:28Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdResolvedLLMNRDelaycks<div class="wikitext"><p><a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdResolvedNotFor">I recently switched my work and home desktops over to systemd-resolved</a> from my previous tangle of an Unbound
configuration. Some time later, on my home desktop I accidentally
typo'd the name of a host I was trying to SSH to and discovered that
there was an appreciable pause and delay before SSH gave up with 'no
such host'. Some testing showed that I could reproduce this in other
programs for any non-existent name with no dots in it, and helpfully
it even reproduced with '<code>resolvectl query nosuchhost</code>'.</p>
<p><a href="https://man7.org/linux/man-pages/man1/resolvectl.1.html">Resolvectl</a>
itself doesn't have any sort of 'trace' or 'debug' option that will
explain what it's doing during name resolution, but you can gingerly
turn on debug logging for resolved with '<code>resolvectl log-level
debug</code>' (and then hastily turn it off afterward), and if you're
lucky not too many other name resolutions will be going on at the
same time. Eventually I was able to get lucky and track down <a href="https://mastodon.social/@cks/109921449751631936">what
was going on</a>,
which was that systemd-resolved was trying to resolve these names
by doing <a href="https://en.wikipedia.org/wiki/Link-Local_Multicast_Name_Resolution">Link-Local Multicast Name Resolution (LLMNR)</a> over
my home machine's DSL PPPoE link. Naturally there was nothing
responding to them, so resolved had to wait for a several-second
timeout before it could declare that there was no such name out
there. Turning LLMNR off on my PPPoE link made the delays go away,
so now nonexistent names fail more or less immediately.</p>
<p>It's possible that if you set up a DSL PPPoE link with NetworkManager,
NM will automatically tell resolved to not try LLMNR over the link.
I don't use NM here (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NetworkManagerWhyConsidering">although I may need to switch someday</a>), so my PPPoE link still had LLMNR
enabled in resolved, although I'd turned off LLMNR for everything
else. On my work desktop I explicitly configured LLMNR off globally
in systemd-resolved, but I hadn't done that at home because it
seemed possible that maybe I'd want it someday (that's now changed).</p>
<p>(As a system administrator, the idea that something on the network
can just decide to start resolving names and get systems to listen
to its views is not exactly a good thought. But things designed for
home networks don't necessarily care about my opinions. On the other
hand, Wikipedia tells me that the big user of LLMNR is Microsoft,
and Microsoft is in process of phasing it out in favour of <a href="https://en.wikipedia.org/wiki/Multicast_DNS">mDNS</a>, which I already had
off.)</p>
<p>I'll probably want to keep my eyes open for this happening on any
machines I run systemd-resolved on. Although it doesn't seem to
happen on another machine that does have LLMNR resolution enabled
on its Ethernet link, so who knows; there may be other resolvectl
things I have set that affect this. Whatever it is, I'm just happy
that now my typos fail immediately.</p>
</div>
Systemd-resolved plus LLMNR can create delays in name non-resolution2024-02-26T21:43:53Z2023-02-26T03:15:40Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/Ext4DirectoriesMaximumSizecks<div class="wikitext"><p>Suppose, <a href="https://mastodon.social/@cks/109871594807466033">not entirely hypothetically</a>, that one day you
discover your (Linux) kernel logging messages like this:</p>
<blockquote><pre style="white-space: pre-wrap;">
EXT4-fs warning (device md1): ext4_dx_add_entry:2461: Directory (ino: 102236164) index full, reach max htree level :2
EXT4-fs warning (device md1): ext4_dx_add_entry:2465: Large directory feature is not enabled on this filesystem
</pre>
</blockquote>
<p>Congratulations, of a sort. You've managed to accumulate so many
files in the directory that it has filled up, in a logical sense.
Unfortunately, further attempts to create files will fail; in fact
they are already failing, because that's how you get the error
message. If you're lucky, your software is logging error messages
and you're noticing them. Also, since your directory got so large,
<a href="https://mastodon.social/@cks/109871594807466033">you may have an unpleasant surprise coming your way</a>.</p>
<p>('Large' here is relative. The directory that this happened to was
only about 575 MBytes as reported by 'ls -lh'; this is very large
for a directory, but not that large for a modern file. The filesystem
as a whole had tons of space free.)</p>
<p>You might reasonably ask how a directory can fill up when it's not
near even the 2 GByte 32-bit file size limit and the filesystem has
plenty of disk space and inodes left. What's going on is that <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/UnixLinearDirectories">in
modern filesystems, (big) directories aren't just linear lists of
entries</a>; instead they're some sort
of tree structure. In ext4 these are called <a href="https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Hash_Tree_Directories">hash tree ('htree')
directories</a>.
When ext4 is adding an entry to such a htree, under some circumstances
it can need to 'split (the) index' (according to code comments) by
adding another level to the tree. However, ext4 has a maximum allowed
number of levels that can be in the tree. If ext4 needs to add a
level and can't because you're already at the maximum level, it
reports the kernel error we're seeing here.</p>
<p>(The direct error message is in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/ext4/namei.c#n2461">fs/ext4/namei.c</a>,
in ext4_dx_add_entry(),
but to understand it you need to know something about ext4 htrees.)</p>
<p>I believe that this is a limit on the total number of entries you can
have in a directory (instead of, say, a limit on the number of entries
with some hash value or range of them). Some reading (<a href="https://www.phoronix.com/news/EXT4-Linux-4.13-Work">cf</a>) suggests that
the normal limit is about 10 million files in a single directory if
you don't have the ext4 'large directories' feature turned on.</p>
<p>(The documentation for large_dir in <a href="https://man7.org/linux/man-pages/man5/ext4.5.html">ext4(5)</a> doesn't give
specific numbers, and I believe it depends on your filesystem's
block size as well, <a href="https://lore.kernel.org/all/2111161753010.26337@stax.localdomain/t/">cf</a>; our
filesystem had 4K blocks, the default. Filesystems with large
directories can have a three-level htree, but filesystems without
the feature are limited to a two-level htree.)</p>
<p>If you're really running a system that should have that many files
in a single directory, you need to turn on the '<code>large_dir</code>' ext4
feature somehow (the <a href="https://man7.org/linux/man-pages/man8/tune2fs.8.html">tune2fs(8)</a> manual page
says this can be turned on without remaking the filesystem).
Otherwise, you need to figure out <a href="https://mastodon.social/@cks/109875312929124453">what's gone wrong with either
your system or your understanding of how it works</a>, then change
things so that it's not trying to put more than ten million files
in one directory. Even if you can turn on large directories, you'll
probably be happier fixing the underlying issue.</p>
</div>
Linux Ext4 directories have a maximum size (in entries)2024-02-26T21:43:53Z2023-02-17T03:58:58Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/LinuxIpFwmarkMaskscks<div class="wikitext"><p>The Linux kernel's general IP environment has a system for <a href="https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.netfilter.html">marking
packets</a>
with what is generally called a <em>fwmark</em>, short for 'firewall mark'.
Fwmarks can be set through iptables, using the MARK target
(documented in <a href="https://man7.org/linux/man-pages/man8/iptables-extensions.8.html">iptables-extensions</a>), or
by facilities such as <a href="https://man7.org/linux/man-pages/man8/wg.8.html">WireGuard</a>, and can then be
used by firewall rules or by <a href="https://www.man7.org/linux/man-pages/man8/ip-rule.8.html">'ip rule'</a> policy
based routing. Fwmarks are how I solved <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/IPRecursiveRoutingProblem">the general recursive
routing problem</a> when I set up
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WireGuardEarlyNotes">my WireGuard environment</a>. All of my uses
of fwmarks have been simply picking a value, setting it, and checking
for it. I was recently working with something that also uses fwmarks,
and I saw unusual things in '<code>ip rules</code>' and '<code>iptables</code>' output:</p>
<blockquote><pre style="white-space: pre-wrap;">
# ip rule list
[...]
5210: from all fwmark 0x80000/0xff0000 lookup main
[...]
# iptables -nL
[...]
MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK xset 0x40000/0xff0000
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 mark match 0x40000/0xff0000
[...]
</pre>
</blockquote>
<p>(<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/IptablesOutputAndInterfaces">This is actually not all of the rule because I didn't use -v</a>; the full MARK rule has an interface
limitation.)</p>
<p>The thing after the '/' in this new to me syntax is a <em>fwmark mask</em>,
which is sometimes called a <em>fwmask</em>. As suggested by its name, a
fwmark mask restricts what portion of the fwmark you're matching
or setting. A plain fwmark rule like 'from all fwmark 0x5151 ....'
matches any packet with a fwmark that is exactly 0x5151 and nothing
else; a rule with a fwmask matches if the portion of the fwmark selected
by the mask matches. So the fwmask here means that any fwmark of the
form '0x08xxxx' will match, regardless of lower order parts; a fwmark
of '0x085151' would match just as well as '0x080000'.
Similarly, setting a mark with a mask only affects the masked portion of
the fwmark, not all of it. If I just set a 0x5151 fwmark on a packet, I
overwrite any existing fwmark it had; if I use a mask, I could turn a
packet with fwmark 0x080000 into one with fwmark 0x085151.</p>
<p>As suggested by this description, fwmark masks are aimed at situations
where multiple pieces of your IP handling may all be trying to use
fwmarks for their own purposes. Since a packet can only have one
fwmark, you need some way to effectively combine multiple marks
together. That's what fwmasks let you do, by having each different
piece only use its own part of the fwmark. Coordinating who gets
which part of the fwmark is up to you, although some piece of
software may have just decided that it will take 0xff0000 and hope
that this doesn't collide with other things. This is actually not
a terrible approach and since there's no registry for fwmark usage
it's hard to do better.</p>
<p>If you're curious (I was), the 'MARK xset' is the way 'iptables -L'
prints what is created with '-j MARK --set-xmark 0x40000/0xff0000'.
I assume that this particular piece of software sometimes sends in
packets that are already set to a fwmark with 0x40000 turned on,
in which case the --set-xmark will flip it off instead of turning
it on.</p>
<p>(This handily illustrates that if you have a piece of software that uses
fwmarks, you can't necessarily tinker with its usage or change its rules
because you don't necessarily know what it's up to. This software reserves
a full byte for its fwmask (0xff), while only apparently using two bits of
it between 'ip rules' and iptables. Does it sometimes set and use the other
bits? Maybe. In this case <a href="https://github.com/tailscale/tailscale">the program in question is open source</a> so I can read the code if I need
to. People interested in this particular case are directed <a href="https://github.com/tailscale/tailscale/blob/main/wgengine/router/router_linux.go#L40">here</a>.)</p>
</div>
Learning about Linux fwmark masks2024-02-26T21:43:53Z2023-02-12T03:23:40Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdResolvedNotForcks<div class="wikitext"><p>Today, in a burst of enthusiasm, I converted my office and home
Fedora desktops to using systemd-resolved (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdResolvedConsidering">as foreshadowed</a>). The result has taught me a lot, and
the general thing I've learned is that <a href="https://www.freedesktop.org/software/systemd/man/systemd-resolved.service.html">systemd-resolved</a>
is much more narrowly scoped than I thought it was. It is not a
general system for handling the potential complexity of name
resolution in a multi-faceted environment; instead, it's focused
on managing a world where network connections come and go, and each
network connection may come with a DNS server you're supposed to
use on it and some DNS names it's good for.</p>
<p>This leaves a lot of things that systemd-resolved is not for. Here
is my current list, as of Fedora's systemd 250 and 251.</p>
<ul><li>it's not for arbitrary mappings of names to DNS servers, independent
of network interfaces. If you simply use a random interface as
an attachment point to map a set of names to a DNS server, you
may get an unpleasant surprise, because under some circumstances
resolved will insist on trying to reach the DNS server through
that interface. This includes if the listed DNS server is
127.0.0.1:53. This makes sense in resolved's narrow scope of 'this
network comes with a DNS server (for some names)'.<p>
This means that if you have a local resolving DNS server, for
example <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LibvirtMyNATStaticIPs">to implement some local names</a>,
the only safe place to specify it and its domains is in the
global DNS and Domains settings in <a href="https://www.freedesktop.org/software/systemd/man/resolved.conf.html">resolved.conf</a>.<p>
(Possibly you can also attach them to 'lo', the loopback
interface.)</li>
</ul>
<p>If you just want to add some local names, your best option is probably
to put them in /etc/hosts. Resolved will turn this into synthetic DNS
data for you without the contortions you need with, eg, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LibvirtMyNATStaticIPs">Unbound</a>. If you need to steer a bunch of names and zones
to a bunch of different servers, I believe you're probably going to
have to configure a local resolving DNS server (like Unbound) and then
point resolved at it in your <a href="https://www.freedesktop.org/software/systemd/man/resolved.conf.html">resolved.conf</a>.</p>
<ul><li>it's not for having any particular order to what's in your
<a href="https://man7.org/linux/man-pages/man5/resolv.conf.5.html">resolv.conf</a>
<code>search</code> directive (or what order resolved uses in its native
DBus interface). Resolved can list domains there if you ask it
to, but there's no specific control to what order they're in.<p>
(If you have multiple search domains where the same name can
be in more than one, you may have strong opinions about what
order they should be tried in.)<p>
</li>
<li>it's not for specifically adding search domains separate from
how names are resolved. Resolved only adds search domains through
their association with some specific DNS server (and thus generally
with some specific interface that DNS server will be reached
over), and afterward that DNS server will (or may) be used for
them, no matter if you want to divert resolution to another DNS
server when it becomes available.</li>
</ul>
<p>If you want to specifically control your /etc/resolv.conf <code>search</code>
outside of resolved's constraints, you need to manually construct
a <a href="https://man7.org/linux/man-pages/man5/resolv.conf.5.html">resolv.conf</a>
to your tastes and then change your <a href="https://man7.org/linux/man-pages/man5/nsswitch.conf.5.html">nsswitch.conf</a> to not
use the <a href="https://www.freedesktop.org/software/systemd/man/nss-resolve.html">'resolve'</a>
mechanism (which goes straight to resolved and its unpredictable
search order).</p>
<ul><li>it's not for exactly duplicating traditional resolv.conf behavior
in name resolution. In particular, as more or less documented in
<a href="https://www.freedesktop.org/software/systemd/man/resolved.conf.html">resolved.conf</a>,
the search order is only used for names with no dots (a 'single-label'
hostname). In a normal DNS environment, if you have 'search
example.org' and you do a DNS lookup for 'fred.bar', it will
eventually look for 'fred.bar.example.org'. Not in resolved; the
<a href="https://www.freedesktop.org/software/systemd/man/resolved.conf.html#Domains=">Domains</a>
are specifically only for single-label hostnames.</li>
</ul>
<p>Using the resolv.conf 'search' for labels with dots has fallen out
of favour lately (especially as new top level domains have
proliferated), but in some environments it's historical practice
that people may be rather attached to. However, at least today if
you talk to resolved only via DNS, things still work, because the
libraries still have the traditional behavior. Only the resolved
DBus interface behaves this way, so now you may have two reasons
to remove <a href="https://www.freedesktop.org/software/systemd/man/nss-resolve.html">'resolve'</a> from the hosts: line in your <a href="https://man7.org/linux/man-pages/man5/nsswitch.conf.5.html">nsswitch.conf</a>.</p>
<ul><li>it's not for environments where networks are giving you DNS
resolution information you want or need to partially ignore (or
override). Resolved has no particularly strong idea of priorities
if two sources claim the same name and is not built to let you
say things like 'don't actually add this name to the search list
despite the network asking for it'. It's your job to fix that
sort of thing upstream in the program that's sending new settings
to resolved.<p>
(I think resolved generally assumes 'the last update wins', which
is perfectly reasonable for its narrow focus. If you bring up a
VPN, you probably want its opinions on DNS resolution to win.)</li>
</ul>
<p>Systemd-resolved's narrow focus is perfectly fine in general. Having
your DHCP client, your N VPN clients, and so on all fighting over your
resolv.conf was a real problem, and systemd-resolved has solved it. I
can't blame systemd-resolved for not being the all consuming center of
handling DNS resolution complexity that I wish it was, and so far I can
work around the things it's not for.</p>
<p>(And in some ways the resolved experience is better and easier to
manage than my previous set of Unbound manipulations; putting static
local names in /etc/hosts is simple and straightforward, for example.)</p>
<h3>Sidebar: My home search domain case as an illustration</h3>
<p>On <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2018">my home desktop</a>, I almost always have a
Wireguard tunnel to work up, but I might not (for example, maybe
the remote end of the tunnel is unavailable). Because we have a
split horizon DNS setup, when the tunnel is up I need queries for
our subdomain to be sent to our internal resolvers. But when the
tunnel is down, some of our machines are still accessible and I
want to still be able to reach them by their short names, so I can
keep on typing 'ssh apps0' and 'ping apps0' instead of switching
to their long forms.</p>
<p>What this means is that I want a search domain set all of the
time, but I only want DNS queries diverted some of the time.
Systemd-resolved isn't for this, because it assumes I should
have the search domain bolted to DNS query redirection being
possible (ie, the tunnel is up).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdResolvedNotFor?showcomments#comments">5 comments</a>.) </div>Things that systemd-resolved is not for (as of systemd 251)2024-02-26T21:43:53Z2023-02-11T04:16:15Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdResolvedConsideringcks<div class="wikitext"><p>I've traditionally had generally lukewarm views on <a href="https://www.freedesktop.org/software/systemd/man/systemd-resolved.service.html">systemd-resolved</a>
for my own desktops and <a href="https://support.cs.toronto.edu/">our</a>
Ubuntu servers. Our Ubuntu servers directly use our local resolving
DNS servers and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdResolvedNotes">I don't use it on my normal desktops</a>, where I have a somewhat complex local
resolver that uses Unbound (I do things like <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LibvirtMyNATStaticIPs">have local names</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/UnboundDNSforVPN">divert resolution for some domains</a>). But while all of this works today, I feel that
in the future my life may be made easier by switching to systemd-resolved,
much like <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NetworkManagerWhyConsidering">I may want to start using a bit of NetworkManager someday</a>.</p>
<p>What makes systemd-resolved interesting, important, and perhaps
ultimately inescapable is that it has become the de facto way to
mediate among different parties who all want to influence your
DNS resolution. Without systemd-resolved, either you get to tell
everyone to keep their hands off /etc/resolv.conf and you have to
sort it out by hand, or you get systems overwriting each other's
work. There are <a href="https://mastodon.social/@cks/109831411715536691">some things I may need to use in the future</a> that very much
want to work through systemd-resolved if at all possible, and
things like libvirt can apparently be hooked up to resolved to
automatically resolve your virtual machine names (<a href="https://github.com/tprasadtp/libvirt-systemd-resolved">a project</a>, <a href="https://github.com/systemd/systemd/issues/18761">an issue with
approaches</a>,
<a href="https://www.stewarts.org.uk/post/libvirtdnsmasqresolved/">a blog post</a>).</p>
<p>My current Unbound resolver setup has a number of carefully maintained
hacks to divert queries for various DNS zones off to places like
our internal DNS resolvers (so that my desktop resolves our local
names properly, including split horizon ones). It would be nice to
use systemd-resolved to eliminate these, but unfortunately <a href="https://www.freedesktop.org/software/systemd/man/resolvectl.html">resolvectl</a>
has a frustrating limit; while it can selectively divert queries
to alternate DNS servers, this diversion must be tied to an interface.
As far as I know, you can't tell resolved 'send all .sandbox queries
to <our local DNS server>'; instead, you have to say 'due to link
X, send all ...'. The 'resolvectl dns' and 'domain' commands that
configure this all require a link.</p>
<p>In a world where you're configuring these things because <a href="https://github.com/juanfont/headscale">your
overlay mesh networking</a>
came up and now you need to make DNS resolution understand some
special names, this makes perfect sense; the special resolution is
tied to the link and will go away when it does. But if you have
several sets of diversions to several different DNS servers that
are always there, you have to find interfaces to attach the extra
ones to. At work I do have these extra interfaces (in the form of
VLANs), but it feels ugly; the DNS diversions have nothing to do
with the interfaces, I just need something to pacify systemd-resolved.
I don't particularly blame resolved for this, because I'm doing
something rather outside of its model.</p>
<p>(The <a href="https://wiki.archlinux.org/title/systemd-resolved">Arch Wiki page on systemd-resolved</a> is quite worth
reading, as usual.)</p>
<p>PS: As far as I know, you can only attach one set of DNS servers
and one set of domain diversions to a given interface. Otherwise I
could attach several sets of them to my primary interface, ideally
in its systemd-networkd configuration.</p>
</div>
I'm considering giving in to the systemd-resolved wave2024-02-26T21:43:53Z2023-02-10T04:30:49Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/TransparentHugepagesBadLuckcks<div class="wikitext"><p>Normally, pages of virtual memory are a relatively small size, such
as 4 Kbytes. <a href="https://wiki.debian.org/Hugepages">Hugepages</a> (<a href="https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt">also</a>) are
a CPU and Linux kernel feature which allows programs to selectively
have much larger pages, which generally improves their performance.
<a href="https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html">Transparent hugepage support</a> is
an additional Linux kernel feature where programs can be more or
less transparently set up with hugepages if it looks like this will
be useful for them. This sounds good but <a href="https://mastodon.social/@cks/109752064557651589">generally I haven't had
the best of luck with them</a>:</p>
<blockquote><p>It appears to have been '0' days since Linux kernel (transparent)
hugepages have dragged one of my systems into the mud for mysterious
reasons. Is my memory too fragmented? Who knows, all I can really do
is turn hugepages off.</p>
<p>(Yes they have some performance benefit when they work, but they're
having a major performance issue now.)</p>
</blockquote>
<p>This time around, the symptom was that Go's self-tests were timing
out while I was trying to build it (or in some runs, the build
itself would stall). While this was going on, top said that the
'<code>khugepaged</code>' kernel daemon process was constantly running (on
a single CPU).</p>
<p>(I'm fairly sure I've seen this sort of 'khugepaged at 100%
and things stalling' behavior before, partly because when I
saw top I immediately assumed THP were the problem, but I can't
remember details.)</p>
<p>One of the issues that can cause problems with hugepages is that
to have huge pages, you need huge areas of contiguous RAM. These
aren't always available, and not having them is <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DecodingPageAllocFailures">one of the reasons
for kernel page allocation failures</a>.
To get these areas of contiguous RAM, the modern Linux kernel uses
(potentially) <a href="https://lwn.net/Articles/817905/">proactive compaction</a>,
which is normally visible as the 'kcompactd0' kernel daemon. Once
you have aligned contiguous RAM that's suitable for use as huge
pages, the kernel needs to turn runs of ordinary sized pages into
hugepages. This is the job of khugepaged; <a href="https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html">to quote</a>:</p>
<blockquote><p>Unless THP is completely disabled, there is [a] khugepaged daemon that
scans memory and collapses sequences of basic pages into huge pages.</p>
</blockquote>
<p>In the normal default kernel settings, this only happens for processes
that use the <a href="https://man7.org/linux/man-pages/man2/madvise.2.html"><code>madvise(2)</code></a> system call
to tell the kernel that a mmap()'d area of theirs is suitable for
this. Go can do this under some circumstances, although I'm not
sure what they are exactly (the direct code that does it is deep
inside the Go runtime).</p>
<p>If you look over the Internet, there are plenty of reports of
khugepaged using all of a CPU, often with responsiveness problems
to go along with it. Sometimes this stops if people quit and restart
some application; at other times, people resort to disabling
transparent hugepages or rebooting their systems. No one seems to
have identified a cause, or figured out what's going on to cause
the khugepaged CPU usage or system slowness (presumably the two
are related, perhaps through lock contention or memory thrashing).</p>
<p>Disabling THP is done through sysfs:</p>
<blockquote><pre style="white-space: pre-wrap;">
echo never >/sys/kernel/mm/transparent_hugepage/enabled
</pre>
</blockquote>
<p>The next time around I may try to limit THP's 'defragmentation'
efforts:</p>
<blockquote><pre style="white-space: pre-wrap;">
echo never >/sys/kernel/mm/transparent_hugepage/defrag
</pre>
</blockquote>
<p>(The normal settings for both of these these days are 'madvise'.)</p>
<p>If I'm understanding <a href="https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html">the documentation</a>
correctly, this will only use a hugepage if one is available
at the time that the program calls madvise(); it won't try to get one
later and swap it in.</p>
<p>(Looking at the documentation makes me wonder if Go and khugepaged
were both fighting back and forth trying to obtain hugepages when
Go made a madvise() call to enable hugepages for some regions.)</p>
<p>I believe I've only really noticed this behavior on my desktops,
which are unusual in that I use ZFS on Linux on them. ZFS has its
own memory handling (the 'ARC'), and historically has had some odd
and uncomfortable interaction with the normal Linux kernel memory
system. Still, it doesn't seem to be just me who has khugepaged
problems.</p>
<p>(I don't think we've seen these issues on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our ZFS fileservers</a>, but then we don't run anything else on the
fileservers. They sit there handling NFS in the kernel and that's
about it. Well, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/LocalVarMailImprovement">there is one exception these days in our IMAP
server</a>, but I'm not sure it
runs anything that uses madvise() to try to use hugepages.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/TransparentHugepagesBadLuck?showcomments#comments">One comment</a>.) </div>I've had bad luck with transparent hugepages on my Linux machines2024-02-26T21:43:53Z2023-02-01T04:04:27Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSOnLinuxTrimNotescks<div class="wikitext"><p>One of the things you can do to keep your SSDs performing well over
time is to <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/SSDsAndBlockDiscardTrim">explicitly discard ('TRIM') disk blocks that are
currently unused</a>. <a href="https://zfsonlinux.org/">ZFS on Linux</a> has support for <a href="https://en.wikipedia.org/wiki/Trim_(computing)">TRIM commands</a> for some time; the
development version got it in 2019, and it first appeared in <a href="https://github.com/openzfs/zfs/releases/tag/zfs-0.8.0">ZoL
0.8.0</a>. When
it was new, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSNoTrimForMeYet">I was a bit nervous about using it immediately</a>, but it's been years since then and recently I
did some experimentation with it. Well, with one version of ZoL's
TRIM support, the manual one.</p>
<p>ZFS on Linux has two ways to periodically TRIM your pool(s), the
automatic way and the manual way. The automatic way is to set
'<a href="https://openzfs.github.io/openzfs-docs/man/7/zpoolprops.7.html#autotrim"><code>autotrim=on</code></a>'
for selected pools; this comes with various cautions that are mostly
covered in <a href="https://openzfs.github.io/openzfs-docs/man/7/zpoolprops.7.html">zpoolprops(7)</a>.
The manual way is to periodically run '<a href="https://openzfs.github.io/openzfs-docs/man/8/zpool-trim.8.html"><code>zpool trim</code></a>'
with suitable arguments. One significant advantage of explicitly
running 'zpool trim' is that you have a lot more control over the
process, and in particular <strong>manual trims let you restrict trimming
to a single device</strong>, instead of having trimming happen on all of
them at once. If you trim your pools for only one device at a time (or
only one device per vdev) and then scrub your pool afterward, you're
pretty well protected against something going wrong in the TRIM
process and the wrong disk blocks getting erased.</p>
<p>(My current experiments with 'zpool trim' are on Ubuntu 22.04 on
some test pools, and scrubs say that nothing has gotten damaged
in them afterward.)</p>
<p>The manual 'zpool trim' supports <a href="https://openzfs.github.io/openzfs-docs/man/8/zpool-trim.8.html#r">a -r command line option</a> that
controls how fast ZFS asks the disk to TRIM blocks. If you set this
to, for example, 100 MBytes (per second), ZoL will only ask your
SSD (or SSDs) to TRIM 100 MBytes of blocks every second. Sending
TRIM commands to the SSD doesn't use read or write bandwidth as
such, but it does ask the SSD to do things and that may affect other
things that the SSD is doing. I wouldn't be surprised if some SSDs
can TRIM at basically arbitrary rates with little to no impact on
IO, while other SSDs get much more visibly distracted. As far as I
can tell from some tests, this rate option does work (at least as
far as ZFS IO statistics report).</p>
<p>I'm not sure how much information '<a href="https://openzfs.github.io/openzfs-docs/man/8/zpool-iostat.8.html">zpool iostat</a>' will
report about ongoing TRIMs (either automatic or manual), but various
information is available in the underlying statistics exported from
the kernel. <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxGettingPoolIostats">Your options for getting at this detailed information
aren't great</a>. At the moment, the
available IO statistics appear to be a per-vdev 'bytes trimmed'
number that counts up during TRIM operations (in <a href="https://github.com/openzfs/zfs/blob/master/include/sys/fs/zfs.h#L1151">sys/fs/zfs.h's
vdev_stat structure</a>),
which only appears to have non-zero values for per-disk IO statistics,
and histograms of the 'IO size' of TRIM operations (but <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSIndividualVsAggregatedIOs">'individual'
IO is not necessarily what you think it is</a>, and there are some
comments that individual TRIM 'IOs' of larger than 16 MBytes will
be counted as 16 MBytes in the histograms, as that's their largest
bucket). As with the 'rate' of trimming, all of these numbers are
really counting the amount of data that ZFS has told the SSD or
SSDs to throw away.</p>
<p>(All of these TRIM IO statistics are exposed by <a href="https://github.com/siebenmann/zfs_exporter">my version of the
ZFS exporter for Prometheus</a>.)</p>
<p>I'm not sure you can do very much with these IO statistics except
use them to tell when your TRIMs ran and on what vdev, and for that
there are other IO 'statistics' that are exposed by ZFS on Linux,
although probably 'zpool iostat' won't tell you about them.</p>
<p>(The 'vdev trim state' is the <a href="https://github.com/openzfs/zfs/blob/master/include/sys/fs/zfs.h#L1329">vdev_trim_state_t enum in
sys/fs/zfs.h</a>,
where 1 means a trim is active, 2 is it's been canceled, 3 is it's
been suspended, and 4 is that it has completed. A zero means that a
trim hasn't been done on this disk.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxTrimNotes?showcomments#comments">One comment</a>.) </div>Some notes on using using TRIM on SSDs with ZFS on Linux2024-02-26T21:43:53Z2023-01-27T03:59:00Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SoftwareRaidDiskCountEffectscks<div class="wikitext"><p>Linux software RAID mirrors have a count of the number of active
disks that are in the array; this is what is set or changed by
<a href="https://man7.org/linux/man-pages/man8/mdadm.8.html">mdadm</a>'s
--raid-devices argument. Your <a href="https://man7.org/linux/man-pages/man5/mdadm.conf.5.html">mdadm.conf</a> may also
list how many active disks an array is supposed to have, in the
'num-devices=' setting (aka a 'tag') for a particular array. The
<a href="https://man7.org/linux/man-pages/man5/mdadm.conf.5.html">mdadm.conf</a> manual page dryly describes this as "[a]s with
<code>level=</code> this is mainly for compatibility with the output of <code>mdadm
--examine --scan</code>", which <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/RaidGrowthGotcha">historically</a> and
currently is not quite accurate, at least when booting (perhaps
only under systemd).</p>
<p>I will give my current conclusion up front; <strong>if you're currently
specifying <code>num-devices=</code> for any software RAID mirrors in your
mdadm.conf, you should probably take the setting out</strong>. I can't
absolutely guarantee that this is either harmless or an improvement,
but the odds seem good.</p>
<p>Updating the device count in software RAID mirrors is required when
you add devices, for example to add your new disks along side your
old disks, and recommended when you remove disks (removing your old
disks because you've decided that your new disks are fine). If you
don't increase the number of devices when you add extra disks, what
you're really doing is adding spares. If you don't decrease the
number of devices on removal, mdadm will send you error reports and
generally complain that there are devices missing. So let's assume
that your software RAID mirror has a correct count.</p>
<p>Let's suppose that you have num-devices set in mdadm.conf and that
your root filesystem's mdadm.conf is the same as the version in
your initramfs (an important qualification because it's the version
in the initramfs that counts during boot). Then there are several
cases you may run into. The happy cases is that the mdadm.conf disk
count matches the actual array's disk count and all disks are visible
and included in the live array. Congratulations, you're booting
fine.</p>
<p>If the mdadm.conf num-devices is higher than the number of devices
claimed by the software RAID array, and the extra disks you removed
are either physically removed or <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidRemovingDiskGotcha">have had their RAID superblocks
zeroed</a>, then your boot will probably
stall and likely error out, or at least that's my recent experience.
This is arguably reasonable, especially if num-devices is a genuinely
optional parameter in mdadm.conf; you told the boot process this array
should have four devices but now it has two, so something is wrong.</p>
<p>If the mdadm.conf num-devices is higher than the number of devices
claimed by the array but the extra disks you removed are present
and didn't have their RAID superblock zeroed, <a href="https://mastodon.social/@cks/109739661990640799">havoc may ensue</a>. It seems quite
likely that your system will assemble the wrong disks into the
software RAID array; perhaps it prefers the first disk you failed
out and removed, because it still claims to be part of a RAID array
that has the same number of disks as mdadm.conf says it should have.</p>
<p>(The RAID superblocks on devices have both a timestamp and an event
count, so mdadm could in theory pick the superblocks with the highest
event count and timestamp, especially if it can assemble an actual
mirror out of them instead of only having one device out of four. But
mdadm is what it is.)</p>
<p>If the mdadm.conf num-devices is lower than the number of devices
claimed by the software RAID array and all of the disks are present
and in sync with each other, then your software RAID array will
assemble without problems during boot. This seems to make num-devices
a minimum for the number of disks your boot environment expects to
see before it declares the RAID array healthy; if you provide extra
disks, that's fine with mdadm. However, if you've removed some disks
from the array and not zeroed their superblocks, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidRemovingDiskGotcha">in the past I've
had the system assemble the RAID array with the wrong disk</a> even though the RAID superblocks
on the other disks agreed with mdadm.conf's num-devices. That may
not happen today.</p>
<p>A modern system with all the disks in sync will boot with an
mdadm.conf that doesn't have any num-devices settings. This is in
fact the way that our Ubuntu 18.04, 20.04, and 22.04 servers set
up their mdadm.conf for the root software RAID array, and it works
for me on Fedora 36 for some recently created software RAID arrays
(that aren't my root RAID array). However, I don't know how such a
system reacts when you remove a disk from the RAID array but don't
zero the disk's RAID superblock. On the whole I suspect that it
won't be worse than what happens when num-devices is set.</p>
</div>
Linux software RAID mirrors, booting, mdadm.conf, and disk counts for non-fun2024-02-26T21:43:53Z2023-01-25T03:37:33Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/Ubuntu2204ServerPhasedUpdatescks<div class="wikitext"><p>I was working on getting one of our 22.04 LTS servers up to date,
even for packages we normally hold, when <a href="https://mastodon.social/@cks/109677500664167117">I hit a mystery and
posted about it on the Fediverse</a>:</p>
<blockquote><p>Why does apt on this 22.04 Ubuntu machine want to hold back a bunch of
package updates even with '--with-new-pkgs --ignore-hold'? Who knows,
it won't tell me why it doesn't like any or all of:</p>
<p>open-vm-tools openssh-client openssh-server openssh-sftp-server
osinfo-db python3-software-properties software-properties-common</p>
<p>(Apt is not my favorite package manager for many reasons, this among
them.)</p>
</blockquote>
<p><a href="https://masto.ai/@snk/109677544506984321">Steve suggested that it was Ubuntu's "Phased Update" system</a>, which is what it turned
out to be. This set me off to do some investigations, and it turns
out that phased (apt) updates explain some other anomalies we've
seen with package updates on our Ubuntu 22.04 machines.</p>
<p>The basic idea of phased updates is explained in <a href="https://wiki.ubuntu.com/StableReleaseUpdates#Phasing">the "Phasing"
section of Ubuntu's page on Stable Release Updates (SRUs)</a>; it's a
progressive rollout of the package to more and more of the system
base. Ubuntu introduced phased updates in 2013 (<a href="https://lwn.net/Articles/563966/">cf</a>) but initially they weren't
directly supported by apt, only by the desktop upgrade programs.
<a href="https://discourse.ubuntu.com/t/phased-updates-in-apt-in-21-04/20345">Ubuntu 21.04 added apt support for phased updates</a> and
Ubuntu 22.04 LTS is thus the first LTS version to subject servers
to phased updates. More explanations of phased updates are in <a href="https://askubuntu.com/a/1431941">this
askubuntu answer</a>, which includes
one way to work around them.</p>
<p>(Note that as far as I know and have seen, security updates are not
released as phased updates; if it's a security update, everyone
gets it right away. Phased updates are only used for regular,
non-security updates.)</p>
<p>Unfortunately apt (or apt-get) won't tell you if an update is being
held back because of phasing. This user-hostile apt issue is tracked
in <a href="https://bugs.launchpad.net/ubuntu/+source/apt/+bug/1988819">Ubuntu bug #1988819</a> and
you should add yourself as someone it affects if this is relevant
to you. Ubuntu has a web page on <a href="https://people.canonical.com/~ubuntu-archive/phased-updates.html">what updates are currently in
phased release</a>,
although packages are removed from this page once they reach 100%.
Having reached 100%, such a package is no longer a phased update,
which will become relevant soon. If you can't see a reason for a
package to be held back, it's probably a phased update but you can
check <a href="https://people.canonical.com/~ubuntu-archive/phased-updates.html">the page</a>
to be sure.</p>
<p>(As covered in <a href="https://wiki.ubuntu.com/StableReleaseUpdates#Phasing">the "Phasing" section</a>, packages
normally move forward through the phased rollout every six hours,
so you can have a package held back on some server in the morning
and then be not-held in the afternoon. This is great fun for
troubleshooting why a given server didn't get a particular update.)</p>
<p>Your place in a phased update is randomized across both different
servers and different packages. If you have a fleet of servers,
they will get each phased update at different times, and the order
won't be consistent from package to package. This explains an anomaly
we've been seeing in our package updates for some time, where
different 22.04 servers would get updates at different times without
any consistent pattern.</p>
<p>The phased update related apt settings available and some of the
technical details are mostly explained in <a href="https://askubuntu.com/a/1246984">this askubuntu answer</a>. If you want to opt out of phased
updates entirely, you have two options; you can have your servers
install all phased updates right away (basically putting you at the
0% start line), or you can skip all phased updates and only install
such packages when they reach 100% and stop being considered phased
updates at all. Unfortunately, as of 22.04 there's no explicit
option to set your servers to have a particular order within all
updates (so that you can have, for example, a 'canary' server that
always installs updates at 0% or 10%, ahead of the rest of the
fleet).</p>
<p>For any given package update, machines are randomized based on the
contents of <a href="https://www.freedesktop.org/software/systemd/man/machine-id.html"><code>/etc/machine-id</code></a>, which
can be overridden for apt by setting <code>APT::Machine-ID</code> to a 32 hex
digit value of your choice (the current version of apt appears to
only use the machine ID for phased updates). If you set this to
the same value across your fleet, your fleet will update in sync
(although not at a predictable point in the phase process); you can
also set subsets of your fleet to different shared values so that
the groups will update at different times. The assignment of a
particular machine to a point in the phased rollout is done through
a relatively straightforward approach; the package name, version,
and machine ID are all combined into a seed for a random number
generator, and then the random number generator is used to produce
a 0 to 100 value, which is your position in the phased rollout. The
inclusion of the package name and version means that a given machine
ID will be at different positions in the phased update for different
packages. All of this turns out to be officially documented in the
"Phased Updates" section of <a href="https://manpages.ubuntu.com/manpages/jammy/man5/apt_preferences.5.html">apt_preferences(5)</a>,
although not in much detail.</p>
<p>(There is a somewhat different mechanism for desktop updates, covered
in <a href="https://askubuntu.com/a/1246984">the previously mentioned askubuntu answer</a>.)</p>
<p>As far as I can see from looking at <a href="https://salsa.debian.org/apt-team/apt">the current apt source code</a>, apt doesn't log anything
at any verbosity if it holds a package back because the package is
a phased update and your machine doesn't qualify for it yet. The
fact that a package was a phased update the last time apt looked
may possibly be recorded in /var/log/apt/eipp.log.xz, but documentation
on this file is sparse.</p>
<p>Now that I've looked at all of this and read about <code>APT::Machine-ID</code>,
we'll probably set it to a single value across all of our fleet
because we find different machines getting updates at different
times to be confusing and annoying (and it potentially complicates
troubleshooting problems that are reported to us, since we normally
assume that all 22.04 machines have the same version of things like
OpenSSH). If we could directly control the position within a phased
rollout we'd probably set up some canary machines, but since we
can't I don't think there's a strong reason to have more than one
machine-id group of machines.</p>
<p>(We could set some very important machines to only get updates when
packages reach 100% and stop being phased updates, but Ubuntu has
a good record of not blowing things up with eg OpenSSH updates.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Ubuntu2204ServerPhasedUpdates?showcomments#comments">7 comments</a>.) </div>Ubuntu 22.04 LTS servers and phased apt updates2024-02-26T21:43:53Z2023-01-14T03:56:18Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/KernelBindBugIn6016cks<div class="wikitext"><p>There's a common saying and rule of thumb in programming (possibly
originating in the C world) that it's never a compiler bug, it's
going to be a bug in your code even if it looks crazy or impossible.
Like all aphorisms it's not completely true, because compilers have
bugs, but it's almost always the case that you haven't actually
found a compiler bug and it's something else. You can say a similar
thing about weird system issues (not) being the fault of a kernel
bug, and so that's what I thought when the development version of
Go started failing a self test when I built it on my Fedora 37
office desktop:</p>
<blockquote><pre style="white-space: pre-wrap;">
--- FAIL: TestTCPListener (0.00s)
listen_test.go:72: skipping tcp test
listen_test.go:72: skipping tcp 0.0.0.0 test
listen_test.go:72: skipping tcp ::ffff:0.0.0.0 test
listen_test.go:72: skipping tcp :: test
listen_test.go:90: tcp 127.0.0.1 should fail
</pre>
</blockquote>
<p>Where <a href="https://go.googlesource.com/go/+/refs/heads/master/src/net/listen_test.go#61">this test in net/listen_test.go</a>
is failing is when it attempts to listen twice on the same localhost
IPv4 address and port. It first binds to and listens on 127.0.0.1
port 0 (that port causes the kernel to assign a free ephemeral port
for it), extracts the actual assigned port, and then attempts to
bind to 127.0.0.1 on the same port a second time.</p>
<p>(The Go networking API bundles the binding and listening together
in one Listen() API, but the socket API itself has them as two
operations; you <code>bind()</code> a socket to some address, then <code>listen()</code>
on it.)</p>
<p>This obviously should fail, except the development version of Go
was claiming that it didn't. Aft first I thought this had to be a
Go change, but soon I found that even older versions of Go didn't
pass this test (when I knew they had when I'd built them), and also
that this test passed on my Fedora 36 home desktop. Which I noticed
was running Fedora's 6.0.15 kernel, while my office machine was
running 6.0.16. That certainly looked like a kernel bug, and indeed
I was able to reproduce it in Python (which is when I eventually
realized this was an issue with bind() instead of listen()).</p>
<p>The Python version allows me to see more about what's going on:</p>
<blockquote><pre style="white-space: pre-wrap;">
>>> from socket import *
>>> s1 = socket(AF_INET, SOCK_STREAM)
>>> s2 = socket(AF_INET, SOCK_STREAM)
>>> s1.bind(('127.0.0.1', 0))
>>> s2.bind(('127.0.0.1', s1.getsockname()[1]))
>>> s1.getsockname()
('127.0.0.1', 54785)
>>> s2.getsockname()
('0.0.0.0', 0)
</pre>
</blockquote>
<p>Rather than binding the second socket or failing with an error, the
kernel has effectively left it unbound (the s2.getsockname() result
here is the same as when the socket is newly created (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/BindingOutgoingSockets">'0.0.0.0'
is usually known as INADDR_ANY</a>).
Replacing <code>SOCK_STREAM</code> with <code>SOCK_DGRAM</code> causes things to fail
with 'address already in use' (errno 98), so this issue seems
specific to TCP.</p>
<p>This kernel error is in Fedora 37's 6.0.16 and 6.0.18, but is gone
in the Rawhide 6.2.0-rc2 and isn't present in the Fedora 6.0.15.
I don't know if it's in any version of 6.1, but I'll probably find
out soon when Fedora updates to it. Interested parties can try it
for themselves, and it's been filed as <a href="https://bugzilla.redhat.com/show_bug.cgi?id=2159802">Fedora bug #2159802</a>.</p>
<p>(This elaborates on <a href="https://mastodon.social/@cks/109666477877960733">a Fediverse thread</a>. I looked at
<a href="https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.0.16">the 6.0.16 changelog</a>,
but nothing jumped out at me.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/KernelBindBugIn6016?showcomments#comments">2 comments</a>.) </div>Sometimes it actually is a kernel bug: bind() in Linux 6.0.162024-02-26T21:43:53Z2023-01-12T03:59:37Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSOurSparesSystemVIcks<div class="wikitext"><p>In <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOurSparesSystemV">my entry on our Linux ZFS spares handling system</a>, I wrote about how we used spares in a two step
preference order, first on another disk connected the same way (SATA
or SAS) and then on any disk necessary. In a comment on the entry,
Simon asked:</p>
<blockquote><p>Doesn't this mean you could end up mirroring to the same (real)
disks? So the redundancy you can normally expect is severely
reduced. Mirroring to the same disk only helps with localized
read/write errors (like a bad sector), but not things like a failed
disk.</p>
</blockquote>
<p>This is a question with a subtle answer, which starts with how we
use disks and what that implies for available spares. We always use
disks in mirrored pairs, and the pairs are fixed; every partition
of every disk has a specific partner. The first partition of the
first SAS-connected disk is always mirrored with the first partition
of the first SATA-connected disk, and so on. This means that in
normal operation (when a disk hasn't failed), all spares also come
in pairs; if the last partition of the first 'SAS' disk isn't used,
neither will be the last partition of the first 'SATA' disk, so
both are available as spares. In addition, we spread our partition
usage across all disks, using the first partition on all pairs
before we start using the second partition on any of them, and so
on.</p>
<p>Since spares come in pairs, if we have as many pairs of spares as
we have partitions on a disk (so four pairs, eight spares in total,
with our current 2 TB disks with four partitions), we're guaranteed
to have enough spares on the same 'type' (SAS connected or SATA
connected) of disk to replace a failed disk. Since the other side
of every mirrored pair is on the different type, the replacement
spares can't wind up on the same physical disk as the other side.
Since we don't entirely allocate one disk before we mostly allocate
all of them, all disks have either zero partitions free or one
partition free and our spares are all on different disks.</p>
<p>(Now that I've written this down I've realized that it's only true
as long as we have no more partitions per disks than we have disks
of a particular type. We have eight disks per type so we're safe
with 4 TB disks and eight partitions per disk, but we'll need to
think about this again if we move beyond that.)</p>
<p>If we have fewer spares than that, we could be forced to use a spare
on the same type of disk as the surviving side of a pair. Even then
we can try to avoid using a partition on the same disk and often
we'll be able to. If the failed disk had no free partitions, its
pair also has no free partitions and we're safe. If it had one free
partition and we have more spares than the number of partitions per
disk (eg six spares with 2 TB disks), we can still find a spare on
another disk than its pair.</p>
<p>The absolute worst case in our current setup is if we're down to
four spares and we lose a disk with one of the spares. Here we need
three spares (for the used partitions on the disk), we only have
three spares left, and one of them is on the pair disk to the one
we lost, which is the disk that needs new mirroring. In this case
we'll mirror one partition on that disk with another partition on
that disk. This still gives us protection against ZFS checksum
errors, but it also means that we overlooked a case when we decided
it was okay to drop down to a minimum of only four spares.</p>
<p>I'll have to think about this analysis for our 4 TB disk, eight
partition case, but certainly for the 2 TB disk, four partition
case it means that the minimum number of spares we should be keeping
is six, not four. Fortunately we don't have any fileservers that
have that few spares at the moment. Also, I need to re-check our
actual code to see if it specifically de-prioritizes the disk of
the partition we're adding a spare to.</p>
<p>(One fileserver wound up at four spares before we upgraded its data
disks to 4 TB SSDs.)</p>
</div>
Our ZFS spares handling system sort of relies on our patterns of disk usage2024-02-26T21:43:53Z2023-01-08T02:14:48Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/FindingPython2UsesWithAuditcks<div class="wikitext"><p>Our Ubuntu systems have had a /usr/bin/python that was Python 2 for
more or less as long as we've had Ubuntu systems, which by now is
more than fifteen years. Over that time, our users have written a
certain amount of Python 2 programs that use '#!/usr/bin/python' to
get Python 2, because that's been the standard way to do it for a
relatively long time. However, <a href="https://utcc.utoronto.ca/~cks/space/blog/python/DebianNoMorePython2">Python 2 is going away on Ubuntu
since it has on Debian</a>, and as part
of that we're probably going to stop having a /usr/bin/python in our
future 24.04 LTS servers. It would be nice to find out which of
our users are still using '/usr/bin/python' so that we can contact
them in advance and get them either to move their programs to
Python 3 or at the very least start using '#!/usr/bin/python2'. One
way to do this is to use the Linux kernel's <a href="https://wiki.archlinux.org/title/Audit_framework">audit framework</a>. Or, really, two
ways, the broad general way and the narrow specific way. Unfortunately
neither of these are ideal.</p>
<p>The ideal option we would like is an audit rule for 'if /usr/bin/python
is being used as the first argument to execve()', or equivalently
'if the name /usr/bin/python is being accessed in order to execute
it'. Unfortunately, as far as I can tell you can't write either
of these potential audit rules, although it may appear that you can.</p>
<p>The narrow specific way is to set a file audit on '/usr/bin/python'
for read access, and then post-process the result to narrow it down
to suitable system calls. For example:</p>
<blockquote><pre style="white-space: pre-wrap;">
-w /usr/bin/python -p r -k bin-python-exec
</pre>
</blockquote>
<p>When you run a program that has a '#!/usr/bin/python', it will
result in an audit log line like:</p>
<blockquote><pre style="white-space: pre-wrap;">
type=SYSCALL msg=audit(1672884109.008:233812): arch=c000003e syscall=89 success=yes exit=7 a0=560a65335980 a1=7fffe74c59e0 a2=1000 a3=560a632ba4e0 items=1 ppid=183379 pid=184601 auid=915 uid=915 gid=1010 euid=915 suid=915 fsuid=915 egid=1010 sgid=1010 fsgid=1010 tty=pts1 ses=6837 comm="fred" exe="/usr/bin/python2.7" subj=unconfined key="bin-python-exec"
</pre>
</blockquote>
<p>Syscall 89 is (64-bit x86) <code>readlink()</code> (per <a href="https://filippo.io/linux-syscall-table/">this table</a>), which in this case is
being done inside the kernel as part of an <code>execve()</code> system call.
The 'exe=' being Python 2.7 means that this can't be a readlink()
call being done by some other program (ls, for example); however,
we can't tell this from someone running Python 2.7 themselves and
doing 'os.readlink("/usr/bin/python")'. This last case is probably
sufficiently uncommon that you can not worry about it, and just
contact the person in question (obtained from the uid= value) to
let them know.</p>
<p>(One drawback of this narrow specific way is that you may not be
able to tell people very much about what program of theirs is
still using '/usr/bin/python'. The comm= value tells you what
it's more or less called, but you don't have the specific path,
although often you can dig it out by decoding the associated
'type=PROCTITLE' audit line for this audit record.)</p>
<p>Using '-p x' to trigger this on 'execute' in the audit rule doesn't
work, because as far as the audit framework is concerned the symbolic
link here is not being executed, it's being read (this is the same
trap as I ran into when I worked out <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Finding32BitProgramsWithAudit">how to use the audit framework
to find 32-bit programs</a>).</p>
<p>The other approach, the broad general way, is to start by auditing
execve(), possibly limited to execve() of /usr/bin/python2.7. I'm
using a filter key option as good general practice, but we're going
to see that it's not actually important:</p>
<blockquote><pre style="white-space: pre-wrap;">
-a always,exit -F arch=b64 -S execve -F path=/usr/bin/python2.7 -k bin-python-exec
-a always,exit -F arch=b32 -S execve -F path=/usr/bin/python2.7 -k bin-python-exec
</pre>
</blockquote>
<p>(You can leave out the second line if you don't have to worry about
32-bit x86 programs.)</p>
<p>This will get you a set of audit records every time Python 2 gets
executed, either directly or via some symlink. Starting from these,
you want to pick out the records where "/usr/bin/python" is the
initial argument to execve. The relevant lines from these records
will look like this:</p>
<blockquote><pre style="white-space: pre-wrap;">
type=SYSCALL msg=audit(1672886976.075:237851): arch=c000003e syscall=59 success=yes exit=0 a0=55fa1621dd90 a1=55fa1621ddf0 a2=55fa16220e40 a3=8 items=3 ppid=183379 pid=189501 auid=915 uid=915 gid=1010 euid=915 suid=915 fsuid=915 egid=1010 sgid=1010 fsgid=1010 tty=pts1 ses=6837 comm="fred" exe="/usr/bin/python2.7" subj=unconfined key="bin-python-exec"
type=EXECVE msg=audit(1672886976.075:237851): argc=2 a0="/usr/bin/python" a1="/tmp/fred"
</pre>
</blockquote>
<p>The 'type=EXECVE' record's 'a0=' value tells you that execve() was
called on the /usr/bin/python symlink, instead of eg /usr/bin/python2.
To get the user ID of the person doing this, you need to look back
to the corresponding 'type=SYSCALL' record for the execve, which
has a matching msg= value. Unfortunately as far as I know the audit
system can't directly match type=EXECVE records for you. The second
argument of the EXECVE record will generally tell you what Python
program is being run, which you can pass on to the user involved.</p>
<p>The advantage of the broad general way is that you may already
be tracing execve() system calls for general system auditing
purposes. If you are, you can exploit your existing auditing
logs by just searching for the relevant EXECVE lines.</p>
<p>Because Linux's audit framework is quite old by now, it's everywhere
and all of the programs and components work. However, these days
it's probably not the best tool for this sort of narrowly scoped
question. Instead, I suspect that something using eBPF tracing is
a better approach these days, even though various aspects of the
eBPF tools are still works in progress, even on relatively recent
Linux distributions.</p>
<p>(I'm still a little bit grumpy that both Ubuntu 22.04 LTS and Fedora
36 broke bits of bpftrace for a while, and I believe 22.04 LTS still
hasn't fixed them. <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/EBPFStillInProgress">We're better than we were in 2020, but still
not great</a>, and then there's <a href="https://twitter.com/thatcks/status/1522659050823077895">problems with
kernel lockdown mode in some environments</a>.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/FindingPython2UsesWithAudit?showcomments#comments">3 comments</a>.) </div>Finding people's use of /usr/bin/python with the Linux audit framework2024-02-26T21:43:53Z2023-01-05T03:57:50Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdCgroupsHierarchiescks<div class="wikitext"><p>Systemd organizes everything on your system into a hierarchy of
cgroups, or if you prefer a hierarchy of units that happen to be
implemented with cgroups. However, what this hierarchy is (or is
going to be) isn't always obvious, and sometimes what shows up
matters, for example <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusSystemdCardinality">because you're generating per-cgroup metrics
and might hit a cardinality explosion</a>. So here are some notes
on things you may see in, for example, <code>systemd-cgls</code> or '<code>systemctl
status</code>' (or <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CgroupsMemoryUsageAccounting">if you're writing something to dump cgroup memory
usage</a>).</p>
<p>At the top level, systemd has a -.slice (the root slice or cgroup).
Underneath that are up to three slices: user.slice, for all user
sessions, system.slice, for all system services, and machine.slice,
for your virtual machines that are started in ways that systemd
knows about (for example, libvirt). You'll probably always have a
system.slice and usually a user.slice if you're looking at a machine,
but many of your machines may not have a machine.slice. There's
also an init.scope, which has PID 1 in it, and possibly some
essentially empty .mount cgroups that systemd-cgls won't bother
showing you.</p>
<p>In a virtualization environment using libvirt, machine.slice will
have a 'machine-qemu-<N>-<name>.scope' for every virtual machine,
except that everything after the 'machine-qemu' bit will have bits
of hex encoding, such as '\x2d' for the '-'. Under each active VM
are some libvirt-created cgroups under 'libvirt', which isn't a
systemd unit (I'm going to skip inventorying them, since I don't
feel qualified to comment). If you've started some virtual machines
and then shut them all down again, 'systemd-cgls' probably won't
show you machine.slice any more, but it's still there as a cgroup
and may well have some amount of RAM usage still charged to it.</p>
<p>Under user.slice, there will normally be a hierarchy for any
individual user login that I'm going to present in a text
diagram form (from systemd-cgls):</p>
<pre style="white-space: pre-wrap;">
├─user.slice
│ └─user-<UID>.slice
│ ├─user@<UID>.service
│ │ ├─session.slice
│ │ │ ├─dbus-broker.service
[...]
│ │ │ └─pipewire.service
│ │ └─init.scope
│ └─session-<NNN>.scope
[...]
</pre>
<p>Depending on the system setup, things may also be in an 'app.slice'
and a 'background.slice' instead of a session.slice; see <a href="https://systemd.io/DESKTOP_ENVIRONMENTS/">Desktop
Environment Integration</a>.
What units you see started in the session and app slices depends
on your system and how you're logging in to it (and you may be
surprised by what gets started for a SSH login, even on a relatively
basic server install).</p>
<p>(The init.scope for a user contains their systemd user instance.)</p>
<p>Under system.slice, you will normally see a whole succession of
'<thing>.service', one for every active systemd service. You can
also see a two level hierarchy for some things, such as templated
systemd services:</p>
<pre style="white-space: pre-wrap;">
├─system-serial\x2dgetty.slice
│ └─serial-getty@ttyS0.service
[...]
├─system-getty.slice
│ └─getty@tty1.service
[...]
├─system-postfix.slice
│ └─postfix@-.service
[...]
</pre>
<p><a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdSocketTemplateServiceNames">Templated systemd socket service units (with their long names)</a> will show up (possibly very
briefly) under a .slice unit for them, eg 'system-oidentd.slice'.
This slice won't necessarily show in 'systemd-cgls' unless there's
an active socket connection at the moment, but systemd seems to
leave it there in /sys/fs/cgroup/system.slice even when it's inactive.</p>
<p>You can also get nested system.slice cgroups for dbus services:</p>
<pre style="white-space: pre-wrap;">
├─system-dbus\x2d:1.14\x2dorg.freedesktop.problems.slice
│ └─dbus-:1.14-org.freedesktop.problems@0.service
</pre>
<p>Inspecting the actual cgroups in /sys/fs/cgroup may also show you
<thing>.mount, <thing>.socket, and <thing>.swap cgroups. Under rare
circumstances you may also see a 'system-systemd\x2dfsck.slice'
cgroup with one or more
.service cgroups for fscks of specific devices.</p>
<p>Now that I've looked at all of this, my view is that if I'm generating
resource usage metrics, I want to stop one level down from the top
level user and system slices in the cgroup hierarchy (which means I
will get 'system-oidentd.slice' but not the individually named socket
activations). This captures most everything interesting and mostly
doesn't risk cardinality explosions from templated units. Virtual
machines under machine.slice need extra handling for cardinality,
because the 'machine-qemu-<N>-[...]' is a constantly incrementing
sequence number; I'll need to take that out somehow.</p>
<p>If I'm reporting on the fly on resource usage, it's potentially
interesting to break user slices down into each session scope and
then the user@<UID>.service. Being detailed under the user service
runs into issues because there's so much potential variety in how
processes are broken up into cgroups. I'd definitely want to be
selective about what cgroups I report on so that only ones with
interesting resource usage show up in the report.</p>
<h3>Sidebar: User cgroups on GNOME and perhaps KDE desktops</h3>
<p>You may remember <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdOomdNowDisabled">my bad experience with systemd-oomd, where it
killed my entire desktop session</a>. Apparently
one reason for systemd-oomd's behavior is that on a modern GNOME
desktop, a lot of applications are confined into separate cgroups,
so if (for example) your Firefox runs away with memory, systemd-oomd
will only kill its cgroup, not your entire session-<NNN>.scope
cgroup. On Fedora 36, this appears to look like this:</p>
<pre style="white-space: pre-wrap;">
│ ├─app.slice
[...]
│ │ ├─app-cgroupify.slice
│ │ │ └─cgroupify@app-gnome-firefox-2838.scope.service
│ │ │ └─ 2845 /usr/libexec/cgroupify app-gnome-firefox-2838.scope
[...]
│ │ ├─app-gnome-firefox-2838.scope
│ │ │ ├─3028
│ │ │ │ └─ 3028 /usr/lib64/firefox/firefox -contentproc [...]
│ │ │ ├─3024
│ │ │ │ └─ 3024 /usr/lib64/firefox/firefox -contentproc [...]
[...]
</pre>
<p>Gnome terminal sessions also have a complex structure:</p>
<pre style="white-space: pre-wrap;">
│ │ ├─app-org.gnome.Terminal.slice (#10028)
│ │ │ ├─vte-spawn-04ae3315-d673-47fc-a31e-f657648a0146.scope (#10774)
│ │ │ │ ├─ 2625 bash
│ │ │ │ ├─ 2654 systemd-cgls
│ │ │ │ └─ 2655 less
│ │ │ └─gnome-terminal-server.service (#10508)
│ │ │ └─ 2478 /usr/libexec/gnome-terminal-server
</pre>
<p>And then there's:</p>
<pre style="white-space: pre-wrap;">
│ │ ├─app-gnome\x2dsession\x2dmanager.slice (#5885)
│ │ │ └─gnome-session-manager@gnome.service
│ │ │ └─ 1663 /usr/libexec/gnome-session-binary [...]
</pre>
<p>So a GNOME desktop can have a lot of nested things in a session or
under an app.slice.</p>
<p>The Fedora 36 Cinnamon desktop doesn't seem to go as far as this,
with a bunch of things still running in the 'session-NNN.scope'
unit, but it does seem to do some things to split Firefox and other
processes off into their own systemd units and cgroups.</p>
<p>(Cgroupify apparently comes from the <a href="https://gitlab.freedesktop.org/benzea/uresourced">uresourced</a> RPM package.)</p>
</div>
Some practical notes on the systemd cgroups/units hierarchies2024-02-26T21:43:53Z2022-12-29T04:38:23Z