Chris's Wiki :: bloghttps://utcc.utoronto.ca/~cks/space/blog/?atomDWiki2024-03-18T01:59:40ZRecently changed pages in Chris's Wiki :: blog.tag:cspace@cks.mef.org,2009-03-24:/blog/tech/WriteBufferingAndSyncscks<div class="wikitext"><p>Pretty much every modern system defaults to having data you write
to filesystems be buffered by the operating system and only written
out asynchronously or when you specially request for it to be flushed
to disk, which gives you <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/WriteBufferingHowMuch">general questions about how much write
buffering you want</a>. Now suppose, not
hypothetically, that you're doing write IO that is pretty much
always going to be specifically flushed to disk (with <code>fsync()</code> or
the equivalent) before the programs doing it consider this write
IO 'done'. You might get this situation where you're writing and
rewriting mail folders, or where the dominant write source is
updating a <a href="https://en.wikipedia.org/wiki/Write-ahead_logging">write ahead log</a>.</p>
<p>In this situation where the data being written is almost always
going to be flushed to disk, I believe the tradeoffs are a bit
different than in <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/WriteBufferingHowMuch">the general write case</a>.
Broadly, you can never actually write at a rate faster than the
write rate of the underlying storage, since in the end you have to
wait for your write data to actually get to disk before you can
proceed. I think this means that you want the OS to start writing
out data to disk almost immediately as your process writes data;
delaying the write out will only take more time in the long run,
unless for some reason the OS can write data faster when you ask
for the flush than before then. In theory and in isolation, you may
want these writes to be asynchronous (up until the process asks for
the disk flush, where you have to synchronously wait for them),
because the process may be able to generate data faster if it's not
stalling waiting for individual writes to make it to disk.</p>
<p>(In OS tuning jargon, we'd say that you want writeback to start
almost immediately.)</p>
<p>However, journaling filesystems and concurrency add some extra
complications. Many journaling filesystems have the journal as a
central synchronization point, where only one disk flush can be in
progress at once and if several processes ask for disk flushes at
more or less the same time they can't proceed independently. If you
have multiple processes all doing write IO that they will eventually
flush and you want to minimize the latency that processes experience,
you have a potential problem if different processes write different
amounts of IO. A process that asynchronously writes a lot of IO and
then flushes it to disk will obviously have a potentially long
flush, and this flush will delay the flushes done by other processes
writing less data, because everything is running through the
chokepoint that is the filesystem's journal.</p>
<p>In this situation I think you want the process that's writing a lot
of data to be forced to delay, to turn its potentially asynchronous
writes into more synchronous ones that are restricted to the true
disk write data rate. This avoids having a large overhang of pending
writes when it finally flushes, which hopefully avoids other processes
getting stuck with a big delay as they try to flush. Although it
might be ideal if processes with less write volume could write
asynchronously, I think it's probably okay if all of them are forced
down to relatively synchronous writes with all processes getting an
equal fair share of the disk write bandwidth. Even in this situation
the processes with less data to write and flush will finish faster,
lowering their latency.</p>
<p>To translate this to typical system settings, I believe that you
want to aggressively trigger disk writeback and perhaps deliberately
restrict the total amount of buffered writes that the system can
have. Rather than allowing multiple gigabytes of outstanding buffered
writes and deferring writeback until a gigabyte or more has
accumulated, you'd set things to trigger writebacks almost immediately
and then force processes doing write IO to wait for disk writes to
complete once you have more than a relatively small volume of
outstanding writes.</p>
<p>(This is in contrast to typical operating system settings, which
will often allow you to use a relatively large amount of system RAM
for asynchronous writes and not aggressively start writeback. This
especially would make a difference on systems with a lot of RAM.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/WriteBufferingAndSyncs?showcomments#comments">2 comments</a>.) </div>Disk write buffering and its interactions with write flushes2024-03-18T01:59:40Z2024-03-18T01:59:25Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/IoniceNotesIIcks<div class="wikitext"><p>In the long ago past, Linux gained some support for <a href="https://www.kernel.org/doc/Documentation/block/ioprio.txt">block IO
priorities</a>,
with some limitations that I noticed <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/IoniceNotes">the first time I looked into
this</a>. These days the Linux kernel has support for
more IO scheduling and limitations, for example in <a href="https://docs.kernel.org/admin-guide/cgroup-v2.html">cgroups v2</a> and <a href="https://docs.kernel.org/admin-guide/cgroup-v2.html#io">its IO
controller</a>.
However <a href="https://man7.org/linux/man-pages/man1/ionice.1.html"><code>ionice</code></a>
is still there and now I want to note some more things, since I
just looked at ionice again (for reasons outside the scope of this
entry).</p>
<p>First, <a href="https://man7.org/linux/man-pages/man1/ionice.1.html"><code>ionice</code></a> and the IO priorities it sets are specifically
only for read IO and synchronous write IO, per <a href="https://man7.org/linux/man-pages/man2/ioprio_set.2.html"><code>ioprio_set(2)</code></a> (this is
the underlying system call that <code>ionice</code> uses to set priorities).
This is reasonable, since IO priorities are attached to processes
and asynchronous write IO is generally actually issued by completely
different kernel tasks and in situations where the urgency of doing
the write is unrelated to the IO priority of the process that
originally did the write. This is a somewhat unfortunate limitation
since often it's write IO that is the slowest thing and the source
of the largest impacts on overall performance.</p>
<p>IO priorities are only effective with some <a href="https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers">Linux kernel IO schedulers</a>, such as <a href="https://docs.kernel.org/block/bfq-iosched.html">BFQ</a>. For obvious reasons
they aren't effective with the 'none' scheduler, which is also the
default scheduler for NVMe drives. I'm (still) unable to tell if IO
priorities work if you're using software RAID instead of sitting your
(supported) filesystem directly on top of a SATA, SAS, or NVMe disk. I
believe that IO priorities are unlikely to work with ZFS, partly because
ZFS often issues read IOs through its own kernel threads instead of
directly from your process and those kernel threads probably aren't
trying to copy around IO priorities.</p>
<p>Even if they pass through software RAID, IO priorities apply at the
level of disk devices (of course). This means that each side of a
software RAID mirror will do IO priorities only 'locally', for IO
issued to it, and I don't believe there will be any global priorities
for read IO to the overall software RAID mirror. I don't know if
this will matter in practice. Since IO priorities only apply to
disks, they obviously don't apply (on the NFS client) to NFS read
IO. Similarly, IO priorities don't apply to data read from the
kernel's buffer/page caches, since this data is already in RAM and
doesn't need to be read from disk. This can give you an ionice'd
program that is still 'reading' lots of data (and that data will
be less likely to be evicted from kernel caches).</p>
<p>Since <a href="https://support.cs.toronto.edu/">we</a> mostly use some combination
of software RAID, ZFS, and NFS, I don't think <code>ionice</code> and IO priorities
are likely to be of much use for us. If we want to limit the impact a
program's IO has on the rest of the system, we need different measures.</p>
</div>
Some more notes on Linux's <code>ionice</code> and kernel IO priorities2024-03-17T03:04:26Z2024-03-17T03:03:23Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/PrometheusDNSMonitoringProblemcks<div class="wikitext"><p>Suppose that you want to make sure that your DNS servers are working
correctly, for both your own zones and for outside DNS names that
are important to you. If you have your own zones you may also care
that outside people can properly resolve them, perhaps both within
the organization and genuine outsiders using public DNS servers.
The traditional answer to this is <a href="https://github.com/prometheus/blackbox_exporter">the Blackbox exporter</a>, which can send
the DNS queries of your choice to the DNS servers of your choice
and validate the result. Well, more or less.</p>
<p>What you specifically do with the Blackbox exporter is that you
configure some <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusBlackboxNotes"><em>modules</em></a> and then you
provide those modules <em>targets</em> to check (through your Prometheus
configuration). When you're probing DNS, the module's configuration
specifies all of the parameters of the DNS query and its validation.
This means that if you are checking N different DNS names to see
if they give you a <a href="https://en.wikipedia.org/wiki/List_of_DNS_record_types">SOA record</a> (or an A
record or a MX record), you need N different modules. Quite reasonably,
the metrics Blackbox generates when you check a target don't
(currently) include the actual DNS name or query type that you're
making. Why this matters is that it makes it difficult to write a
generic alert that will create a specific message that says 'asking
for the X type of record for host Y failed'.</p>
<p>You can somewhat get around this by encoding this information into
the names of your Blackbox modules and then doing various creative
things in your Prometheus configuration. However, you still have
to write all of the modules out, even though many of them may be
basically cut and paste versions of each other with only the DNS
names changed. This has a number of issues, including that it's a
disincentive to doing relatively comprehensive cross checks. (I
speak from experience with <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">our Prometheus setup</a>.)</p>
<p>There is a third party <a href="https://dns-exporter.readthedocs.io/latest/">dns_exporter</a> that can be set up
in a more flexible way where all parts of the DNS check can be
provided by Prometheus (although it exposes some metrics that risk
label cardinality explosions). However this still leaves you to
list in your Prometheus configuration a cross-matrix of every DNS
name you want to query and every DNS server you want to query
against. What you'll avoid is needing to configure a bunch of
Blackbox modules (although what you lose is the ability to verify
that the queries returned specific results).</p>
<p>To do better, I think we'd need to write a custom program (perhaps
run through <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusScriptExporterWhy">the script exporter</a>)
that contained at least some of this knowledge, such as what DNS
servers to check. Then our Prometheus configuration could just say
'check this DNS name against the usual servers' and the script would
know the rest. Unfortunately you probably can't reuse any of the
current Blackbox code for this, even if you wrote the core of this
script in Go.</p>
<p>(You could make such a program relatively generic by having it
take the list of DNS servers to query from a configuration file.
You might want to make it support multiple lists of DNS servers,
each of them named, and perhaps set various flags on each server,
and you can get quite elaborate here if you want to.)</p>
<p>(This elaborates on <a href="https://mastodon.social/@cks/112094720473776008">a Fediverse post of mine</a>.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusDNSMonitoringProblem?showcomments#comments">One comment</a>.) </div>The problem of using basic Prometheus to monitor DNS query results2024-03-16T02:38:31Z2024-03-16T02:37:29Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/SerialNumbersMaybeSensitivecks<div class="wikitext"><p>Recently, a commentator on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PrometheusHostAgentNonRootLosses">my entry about what's lost when running
the Prometheus host agent as a non-root user on Linux</a> pointed out that if you
do this, one of the things omitted (that I hadn't noticed) is part
of the system <a href="https://en.wikipedia.org/wiki/Desktop_Management_Interface">DMI</a>
information. Specifically, you lose various serial numbers and the
'product UUID', which is potentially another unique identifier for
the system, because Linux makes the <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/DMIDataInSysfs">/sys/class/dmi/id</a> files with these readable only by root
(this appears to have been the case since support for these was
added to /sys in 2007). This got me thinking about whether serial
numbers are something we should consider sensitive in general.</p>
<p>My tentative conclusion is that for <a href="https://support.cs.toronto.edu/">us</a>,
serial numbers probably aren't sensitive enough to do anything
special about. I don't think any of our system or component serial
numbers can be used to issue one time license keys or the like, and
while people could probably do some mischief with some of them,
this is likely a low risk thing in our academic environment.</p>
<p>(Broadly we don't consider any metrics to be deeply sensitive, or
to put it another way we wouldn't want to collect any metrics that
are because in our environment it would take a lot of work to protect
them. And we do collect DMI information and put it into <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">our metrics
system</a>.)</p>
<p>This doesn't mean that serial numbers have no sensitivity even for
us; I definitely do consider them something that I generally wouldn't
(and don't) put in entries here, for example. Depending on the vendor,
revealing serial numbers to the public may let the public do things like
see your exact system configuration, when it was delivered, and other
potentially somewhat sensitive information. There's also more of a risk
that bored Internet people will engage in even minor mischief.</p>
<p>However, your situation is not necessarily like ours. There are
probably plenty of environments where serial numbers are potentially
more sensitive or more dangerous if exposed (especially if exposed
widely). And in some environments, people run semi-hostile software
that would love to get its hands on a permanent and unique identifier
for the machine. Before you gather or expose serial number information
(for systems or for things like disks), you might want to think about
this.</p>
<p>At the same time, having relatively detailed hardware configuration
information can be important, as in <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/DMIVendorPeculiarities">the war story that inspired
me to start collecting this information in our metrics system</a>. And serial numbers are a great way to
disambiguate exactly which piece of hardware was being used for
what, when. We deliberately collect disk drive serial number information
from SMART, for example, and put it into our metrics system (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaInfiniteSerialNumber">sometimes
with amusing results</a>).</p>
</div>
You might want to think about if your system serial numbers are sensitive2024-03-15T03:04:26Z2024-03-15T03:03:24Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdNetworkdResetsIpRulescks<div class="wikitext"><p>Here's something that I learned recently: if <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd-networkd.service.html#">systemd-networkd</a>
restarts, for example because of a package update for it that
includes an automatic daemon restart, it will clear your 'ip rules'
routing policies (and also I think your routing table, although you
may not notice that much). If you've set up policy based routing of
your own (or some program has done that as part of its operation),
this may produce unpleasant surprises.</p>
<p>Systemd-networkd does this fundamentally because <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.network.html#%5BRoutingPolicyRule%5D%20Section%20Options">you can set ip
routing policies in .network files</a>.
When networkd is restarted, one of the things it does is re-set-up
whatever routing policies you specified; if you didn't specify any,
it clears them. This is a reasonably sensible decision, both to
deal with changes from previously specified routing policies and
to also give people a way to clean out their experiments and reset
to a known good base state. Similar logic applies to routes.</p>
<p>This can be controlled through <a href="https://www.freedesktop.org/software/systemd/man/latest/networkd.conf.html">networkd.conf</a>
and its drop-in files, by setting <a href="https://www.freedesktop.org/software/systemd/man/latest/networkd.conf.html#ManageForeignRoutingPolicyRules="><code>ManageForeignRoutingPolicyRules=no</code></a>
and perhaps <a href="https://www.freedesktop.org/software/systemd/man/latest/networkd.conf.html#ManageForeignRoutes="><code>ManageForeignRoutes=no</code></a>.
Without testing it through a networkd restart, I believe that the
settings I want are:</p>
<blockquote><pre style="white-space: pre-wrap;">
[Network]
ManageForeignRoutingPolicyRules=no
ManageForeignRoutes=no
</pre>
</blockquote>
<p>The minor downside of this for me is that certain sorts of route updates
will have to be done by hand, instead of by updating .network files and
then restarting networkd.</p>
<p>While having an option to do this sort of clearing is sensible, I
am dubious about the current default. In practice, coherently
specifying routing policies through .network files is so much of a
pain that I suspect that few people do it that way; instead I suspect
that most people either script it to issue the 'ip rule' commands
(as I do) or use software that does it for them (and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LinuxIpFwmarkMasks">I know that
such software exists</a>). It would be great if
networkd could create and manage high level policies for you (such
as isolated interfaces), but the current approach is both verbose
and limited in what you can do with it.</p>
<p>(As far as I know, networkd can't express rules for networks that
can be brought up and torn down, because it's not an event-based
system where you can have it react to the appearance of an interface
or a configured network. It's possible I'm wrong, but if so it
doesn't feel well documented.)</p>
<p>All of this is especially unfortunate on Ubuntu servers, which normally
configure their networking through netplan. Netplan will more or less
silently use networkd as the backend to actually implement what you
wrote in your Netplan configuration, leaving you exposed to this, and on
top of that Netplan itself has limitations on what routing policies you
can express (pushing you even more towards running 'ip rule' yourself).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdNetworkdResetsIpRules?showcomments#comments">2 comments</a>.) </div>Restarting systemd-networkd normally clears your 'ip rules' routing policies2024-03-14T02:19:13Z2024-03-14T02:18:11Ztag:cspace@cks.mef.org,2009-03-24:/blog/web/TLSCertsWhatIsManualcks<div class="wikitext"><p>Recently I casually wrote about <a href="https://utcc.utoronto.ca/~cks/space/blog/web/TLSCertsSomeStillManual">how even big websites may still
be manually managing TLS certificates</a>.
Given that we're talking about big websites, this raises a somewhat
interesting question of what we mean by 'manual' and 'automatic' TLS
certificate management.</p>
<p>A modern big website probably has a bunch of front end load balancers
or web servers that terminate TLS, and regardless of what else is
involved in their TLS certificate management it's very unlikely
that system administrators are logging in to each one of them to
roll over its TLS certificate to a new one (any more than they
manually log in to those servers to deploy other changes). At the
same time, if the only bit of automation involved in TLS certificate
management is deploying a TLS certificate across the fleet (once
you have it) I think most people would be comfortable still calling
that (more or less) 'manual' TLS certificate management.</p>
<p>As a system administrator who used to deal with TLS certificates
(back then I called them SSL certificates) the fully manual way, I
see three broad parts to fully automated management of TLS certificates:</p>
<ul><li><em>automated deployment</em>, where once you have the new TLS certificate
you don't have to copy files around on a particular server, restart
the web server, and so on. Put the TLS certificate in the right
place and maybe push a button and you're done.<p>
</li>
<li><em>automated issuance</em> of TLS certificates, where you don't have to
generate keys, prepare a <a href="https://en.wikipedia.org/wiki/Certificate_signing_request">CSR</a>, go to a
web site, perhaps put in your credit card information or some other
'cost you money' stuff, perhaps wait for some manual verification or
challenge by email, and finally download your signed certificate.
Instead you run a program and you have a new TLS certificate.<p>
</li>
<li><em>automated renewal</em> of TLS certificates, where you don't have to
remember to do anything by hand when your TLS certificates are
getting close enough to their expiry time. (A lesser form of
automated renewal is automated reminders that you need to manually
renew.)</li>
</ul>
<p>As a casual thing, if you don't have fully automated management of TLS
certificates I would say you had 'manual management' of them, because
a human had to do something to make the whole process go. If I was
trying to be precise and you had automated deployment but not the other
two, I might describe you as having 'mostly manual management' of your
TLS certificates. If you had automated issuance (and deployment) but
no automated renewals, I might say you had 'partially automated' or
'partially manual' TLS certificate management.</p>
<p>(You can have automated issuance but not automated deployment or
automated renewal and at that point I'd probably still say you had
'manual' management, because people still have to be significantly
involved even if you don't have to wrestle with a TLS Certificate
Authority's website and processes.)</p>
<p>I believe that at least some TLS Certificate Authorities support
automated issuance of year long certificates, but I'm not sure.
Now that I've looked, I'm going to have to stop assuming that a
website using a year-long TLS certificate is a reliable sign that
they're not using automated issuance.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/web/TLSCertsWhatIsManual?showcomments#comments">One comment</a>.) </div>What do we count as 'manual' management of TLS certificates2024-03-13T02:30:16Z2024-03-13T02:29:15Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/UsageDataWhyCarecks<div class="wikitext"><p>I recently wrote about <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/UsageDataSomeBits">some practical-focused thoughts on usage
data for your services</a>. But there's a broader
issue about usage data for services and having or not having it.
My sense is that for a lot of sysadmins, building things to collect
usage data feels like accounting work and likely to lead to unpleasant
and damaging things, like internal chargebacks (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/UniversityDeadlyCharging">which have create
various problems</a>, <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/ChargingProblem">and also</a>). However, I think we should strongly
consider routinely gathering this data anyway, for fundamentally
the same reasons as <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SSLLogConnectionInfo">you should collect information on what TLS
protocols and ciphers are being used by your people and software</a>.</p>
<p>We periodically face decisions both obvious and subtle about what
to do about services and the things they run on. Do we spend the
money to buy new hardware, do we spend the time to upgrade the
operating system or the version of the third party software, do we
need to closely monitor this system or service, does it need to be
optimized or be given better hardware, and so on. Conversely, maybe
this is now a little-used service that can be scaled down, dropped,
or simplified. In general, the big question is <strong>do we need to care
about this service, and if so how much</strong>. High level usage data is
what gives you most of the real answers.</p>
<p>(In some environments one fate for narrowly used services is to be made
the responsibility of the people or groups who are the service's big
users, instead of something that is provided on a larger and higher
level.)</p>
<p>Your system and application metrics can provide you some basic
information, like whether your systems are using CPU and memory and
disk space, and perhaps how that usage is changing over a relatively
long time base (if you keep metrics data long enough). But they
can't really tell you why that is happening or not happening, or
who is using your services, and deriving usage information from
things like CPU utilization requires either knowing things about
how your systems perform or assuming them (eg, assuming you can
estimate service usage from CPU usage because you're sure it uses
a visible amount of CPU time). Deliberately collecting actual
usage gives you direct answers.</p>
<p>Knowing who is using your services and who is not also gives you
the opportunity to talk to both groups about what they like about
your current services, what they'd like you to add, what pieces of
your service they care about, what they need, and perhaps what's
keeping them from using some of your services. If you don't have
usage data and don't actually ask people, you're flying relatively
blind on all of these questions.</p>
<p>Of course collecting usage data has its traps. One of them is that what
usage data you collect is often driven by what sort of usage you think
matters, and in turn this can be driven by how you expect people to use
your services and what you think they care about. Or to put it another
way, you're measuring what you assume matters and you're assuming what
you don't measure doesn't matter. You may be wrong about that, which is
one reason why talking to people periodically is useful.</p>
<p>PS: In theory, gathering usage data is separate from the question
of <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/DangerousMetrics">whether you should pay attention to it</a>,
where the answer may well be that <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/MetricsAttractAttention">you should ignore that shiny
new data</a>. In practice, well, people are
bad at staying away from shiny things. Perhaps it's not a bad thing
to have your usage data require some effort to assemble.</p>
<p>(This is partly written to persuade myself of this, because maybe we
want to routinely collect and track more usage data than we currently
do.)</p>
</div>
Why we should care about usage data for our internal services2024-03-12T02:47:26Z2024-03-12T02:47:02Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemResponseLatencyMetricscks<div class="wikitext"><p>One of the things that I do on my desktops and <a href="https://support.cs.toronto.edu/">our</a> servers is collect metrics that
I hope will let me assess how responsive our systems are when people
are trying to do things on them. For a long time I've been collecting
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PrometheusLinuxDiskIOStats">disk IO latency histograms</a>, and
recently I've been collecting runqueue latency histograms (using
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/EbpfExporterNotes">the eBPF exporter</a> and a modified version of
<a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/runqlat.bpf.c">libbpf/tools/runqlat.bpf.c</a>).
This has caused me to think about the various sorts of latency that
affects responsiveness and how I can measure it.</p>
<p>Run queue latency is the latency between when a task becomes able
to run (or when it got preempted in the middle of running) and when
it does run. This latency is effectively the minimum (lack of)
response from the system and is primarily affected by CPU contention,
since the major reason tasks have to wait to run is other tasks
using the CPU. For obvious reasons, high(er) run queue latency is
related to <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PSINumbersAndMeanings">CPU pressure stalls</a>, but a
histogram can show you more information than an aggregate number.
I expect run queue latency to be what matters most for a lot of
programs that mostly talk to things over some network (including
talking to other programs on the same machine), and perhaps some
of their time burning CPU furiously. If your web browser can't get
its rendering process running promptly after the HTML comes in, or
if it gets preempted while running all of that Javascript, this
will show up in run queue latency. The same is true for your window
manager, which is probably not doing much IO.</p>
<p>Disk IO latency is the lowest level indicator of things having to
wait on IO; it sets a lower bound on how little latency processes
doing IO can have (assuming that they do actual disk IO). However,
direct disk IO is only one level of the Linux IO system, and the
Linux IO system sits underneath filesystems. What actually matters
for responsiveness and latency is generally how long user-level
filesystem operations take. In an environment with sophisticated,
multi-level filesystems that have complex internal behavior (such
as <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSGlobalZILInformation">ZFS and its ZIL</a>), the actual disk
IO time may only be a small portion of the user-level timing,
especially for things like <code>fsync()</code>.</p>
<p>(Some user-level operations may also not do any disk IO at all
before they return from the kernel (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/UserIOCanBeSystemTime">for example</a>).
A <code>read()</code> might be satisfied from the kernel's caches, and a
<code>write()</code> might simply copy the data into the kernel and schedule
disk IO later. This is where histograms and related measurements
become much more useful than averages.)</p>
<p>Measuring user level filesystem latency can be done through eBPF,
to at least some degree; <a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/vfsstat.bpf.c">libbpf-tools/vfsstat.bpf.c</a>
hooks a number of kernel vfs_* functions in order to just count
them, and you could convert this into some sort of histogram. Doing
this on a 'per filesystem mount' basis is probably going to be
rather harder. On the positive side for us, hooking the vfs_*
functions does cover the activity a NFS server does for NFS clients
as well as truly local user level activity. Because there are a
number of systems where we really do care about the latency that
people experience and want to monitor it, I'll probably build some
kind of vfs operation latency histogram <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/EbpfExporterNotes">eBPF exporter program</a>, although most likely only for selected VFS
operations (since there are a lot of them).</p>
<p>I think that the straightforward way of measuring user level IO
latency (by tracking the time between entering and exiting a top
level vfs_* function) will wind up including run queue latency
as well. You will get, basically, the time it takes to prepare and
submit the IO inside the kernel, the time spent waiting for it, and
then after the IO completes the time the task spends waiting inside
the kernel before it's able to run.</p>
<p>Because of <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LinuxMultiCPUIowait">how Linux defines iowait</a>, the
higher your iowait numbers are, the lower the run queue latency
portion of the total time will be, because iowait only happens on
idle CPUs and idle CPUs are immediately available to run tasks when
their IO completes. You may want to look at <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PSINumbersAndMeanings">io pressure stall
information</a> for a more accurate track of
when things are blocked on IO.</p>
<p>A complication of measuring user level IO latency is that not all
user visible IO happens through <code>read()</code> and <code>write()</code>. Some of it
happens through accessing <code>mmap()</code>'d objects, and under memory
pressure some of it will be in the kernel paging things back in
from wherever they wound up. I don't know if there's any particularly
easy way to hook into this activity.</p>
</div>
Scheduling latency, IO latency, and their role in Linux responsiveness2024-03-11T03:32:47Z2024-03-11T03:31:46Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/UsageDataSomeBitscks<div class="wikitext"><p>Some day, you may be called on by decision makers (including yourself)
to provide some sort of usage information for things you operate so
that you can make decisions about them. I'm not talking about <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">system
metrics</a> such as how much CPU is being
used (although for some systems that may be part of higher level usage
information, for example for <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SlurmHowWeUseIt">our SLURM cluster</a>);
this is more on the level of how much things are being used, by who,
and perhaps for what. In the very old days we might have called this
'accounting data' (and perhaps disdained collecting it unless we were
forced to by things like chargeback policies).</p>
<p>In an ideal world, you will already be generating and retaining the
sort of usage information that can be used to make decisions about
services. But internal services aren't necessarily automatically
instrumented the way revenue generating things are, so you may not
have this sort of thing built in from the start. In this case,
you'll generally wind up hunting around for creative ways to generate
higher level usage information from low level metrics and logs that
you do have. When you do this, my first suggestion is <strong>write down
how you generated your usage information</strong>. This probably won't be
the last time you need to generate usage information, and also if
decision makers (including you in the future) have questions about
exactly what your numbers mean, you can go back to look at exactly
how you generated them to provide answers.</p>
<p>(Of course, your systems may have changed around by the next time you
need to generate usage information, so your old ways don't work or
aren't applicable. But at least you'll have something.)</p>
<p>My second suggestion is to look around today to see if there's data you
can easily collect and retain now that will let you provide better usage
information in the future. This is obviously related to <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/KeepLogsLonger">keeping your
logs longer</a>, but it also includes making sure that
things make it to your logs (or at least to your retained logs, which
may mean setting things to send their log data to syslog instead of
keeping their own log files). At this point I will sing the praises of
things like 'end of session' summary log records that put all of the
information about a session in a single place instead of forcing you to
put the information together from multiple log lines.</p>
<p>(When you've just been through the exercise of generating usage data
is an especially good time to do this, because you'll be familiar with
all of the bits that were troublesome or where you could only provide
limited data.)</p>
<p>Of course there are privacy implications of retaining lots of logs
and usage data. This may be a good time to ask around to get advance
agreement on what sort of usage information you want to be able to
provide and what sort you definitely don't want to have available
for people to ask for. This is also another use for arranging to log
your own 'end of session' summary records, because if you're doing it
yourself you can arrange to include only the usage information you've
decided is okay.</p>
</div>
Some thoughts on usage data for your systems and services2024-03-10T03:11:41Z2024-03-10T03:10:39Ztag:cspace@cks.mef.org,2009-03-24:/blog/programming/ShellPipelineStepsAndCPUscks<div class="wikitext"><p>Over on the Fediverse, <a href="https://mastodon.social/@cks/112051065669048777">I had a realization</a>:</p>
<blockquote><p>This is my face when I realize that on a big multi-core machine, I
want to do 'sed ... | sed ... | sed ...' instead of the nominally more
efficient 'sed -e ... -e ... -e ...' because sed is single-threaded
and if I have several costly patterns, multiple seds will parallelize
them across those multiple cores.</p>
</blockquote>
<p>Even when doing on the fly shell pipelines, I've tended to reflexively
use 'sed -e ... -e ...' when I had multiple separate sed transformations
to do, instead of putting each transformation in its own 'sed'
command. Similarly I sometimes try to cleverly merge multi-command
things into one command, although usually I don't try too hard. In
a world where you have enough cores (well, CPUs), this isn't
necessarily the right thing to do. Most commands are single threaded
and will use only one CPU, but every command in a pipeline can run
on a different CPU. So splitting up a single giant 'sed' into several
may reduce a single-core bottleneck and speed things up.</p>
<p>(Giving sed multiple expressions is especially single threaded because
sed specifically promises that they're processed in order, and sometimes
this matters.)</p>
<p>Whether this actually matters may vary a lot. In my case, <a href="https://mastodon.social/@cks/112052064868483369">it only
made a trivial difference in the end</a>, partly because
only one of my sed patterns was CPU-intensive (but that pattern
alone made sed use all the CPU it could get and made it the bottleneck
in the entire pipeline). In some cases adding more commands may add
more in overhead than it saves from parallelism. There are no
universal answers.</p>
<p>One of my lessons learned from this is that if I'm on a machine
with plenty of cores and doing a one-time thing, it probably isn't
worth my while to carefully optimize how many processes are being
run as I evolve the pipeline. I might as well jam more pipeline
steps whenever and wherever they're convenient. If it's easy to
move one step closer to the goal with one more pipeline step, do
it. Even if it doesn't help, it probably won't hurt very much.</p>
<p>Another lesson learned is that I might want to look for single
threaded choke points if I've got a long-running shell pipeline.
These are generally relatively easy to spot; just run 'top' and
look for what's using up all of one CPU (on Linux, this is 100%
CPU time). Sometimes this will be as easy to split as 'sed' was,
and other times I may need to be more creative (for example, if
zcat is hitting CPU limits, maybe <a href="https://zlib.net/pigz/">pigz</a>
can help a bit.</p>
<p>(If I have the fast disk space, possibly un-compressing the files
in place in parallel will work. This comes up in system administration
work more than you'd think, since we can want to search and process
log files and they're often stored compressed.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/programming/ShellPipelineStepsAndCPUs?showcomments#comments">One comment</a>.) </div>A realization about shell pipeline steps on multi-core machines2024-03-09T03:28:42Z2024-03-09T03:27:42Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/EbpfExporterNotescks<div class="wikitext"><p>I've been a fan of <a href="https://github.com/cloudflare/ebpf_exporter">the Cloudflare eBPF Prometheus exporter</a> for some time, ever
since I saw their example of per-disk IO latency histograms. And
the general idea is extremely appealing; you can gather a lot of
information with eBPF (usually from the kernel), and the ability
to turn it into metrics is potentially quite powerful. However,
actually using it has always been a bit arcane, especially if you
were stepping outside the bounds of Cloudflare's <a href="https://github.com/cloudflare/ebpf_exporter/tree/master/examples">canned examples</a>.
So here's some notes on the current version (which is more or less
v2.4.0 as I write this), written in part for me in the future when
I want to fiddle with eBPF-created metrics again.</p>
<p>If you build the ebpf_exporter yourself, you want to use their
provided Makefile rather than try to do it directly. This Makefile
will give you the choice to build a 'static' binary or a dynamic
one (with 'make build-dynamic'); the static is the default. I put
'static' into quotes because of <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LinuxStaticLinkingVsGlibc">the glibc NSS problem</a>; if you're on a glibc-using Linux, your
static binary will still depend on your version of glibc. However,
it will contain a statically linked libbpf, which will make your
life easier. Unfortunately, building a static version is impossible
on some Linux distributions, such as Fedora, because Fedora just
doesn't provide static versions of some required libraries (as far
as I can tell, libelf.a). If you have to build a dynamic executable,
a normal ebpf_exporter build will depend on the libbpf shared
library you can find in libbpf/dest/usr/lib. You'll need to set a
<code>LD_LIBRARY_PATH</code> to find this copy of libbpf.so at runtime.</p>
<p>(You can try building with the system libbpf, but it may not be
recent enough for ebpf_exporter.)</p>
<p>To get metrics from eBPF with ebpf_exporter, you need an eBPF
program that collects the metrics and then a YAML configuration
that tells ebpf_exporter how to handle what the eBPF program
provides. The original version of ebpf_exporter had you specify
eBPF programs in text in your (YAML) configuration file and then
compiled them when it started. This approach has fallen out of
favour, so now eBPF programs must be pre-compiled to special .o
files that are loaded at runtime. I believe these .o files are
relatively portable across systems; I've used ones built on Fedora
39 on Ubuntu 22.04. The simplest way to build either a provided
example or your own one is to put it in <a href="https://github.com/cloudflare/ebpf_exporter/tree/master/examples">the <code>examples</code> directory</a>
and then do 'make <name>.bpf.o'. Running 'make' in the examples
directory will build all of the standard examples.</p>
<p>To run an eBPF program or programs, you copy their <name>.bpf.o and
<name>.yaml to a configuration directory of your choice, specify
this directory in theebpf_exporter '<code>--config.dir</code>' argument,
and then use '<code>--config.names=<name>,<name2>,...</code>' to say what
programs to run. The suffix of the YAML configuration file and the
eBPF object file are always fixed.</p>
<p>The repository has <a href="https://github.com/cloudflare/ebpf_exporter#configuration-concepts">some documentation on the YAML (and eBPF) that
you have to write to get metrics</a>.
However, it is probably not sufficient to explain how to modify the
examples or especially to write new ones. If you're doing this (for
example, to revive an old example that was removed when the exporter
moved to the current pre-compiled approach), you really want to
read over existing examples and then copy their general structure
more or less exactly. This is especially important because the main
ebpf_exporter contains some special handling for at least
histograms that assumes things are being done as in their examples.
When reading examples, it helps to know that Cloudflare has a bunch
of helpers that are in various header files in the examples directory.
You want to use these helpers, not the normal, standard <a href="https://man7.org/linux/man-pages/man7/bpf-helpers.7.html">bpf helpers</a>.</p>
<p>(However, although not documented in <a href="https://man7.org/linux/man-pages/man7/bpf-helpers.7.html">bpf-helpers(7)</a>,
'<code>__sync_fetch_and_add()</code>' is a standard eBPF thing. It is not
so much documented as mentioned in <a href="https://docs.kernel.org/bpf/map_array.html">some kernel BPF documentation
on arrays and maps</a>
and in <a href="https://man7.org/linux/man-pages/man2/bpf.2.html">bpf(2)</a>.)</p>
<p>One source of (e)BPF code to copy from that is generally similar
to what you'll write for ebpf_exporter is <a href="https://github.com/iovisor/bcc/tree/master/libbpf-tools">bcc/libbpf-tools</a> (in the
<name>.bpf.c files). An eBPF program like <a href="https://github.com/iovisor/bcc/tree/master/libbpf-tools/runqlat.bpf.c">runqlat.bpf.c</a>
will need restructuring to be used as an ebpf_exporter program,
but it will show you what you can hook into with eBPF and how.
Often these examples will be more elaborate than you need for
ebpf_exporter, with more options and the ability to narrowly
select things; you can take all of that out.</p>
<p>(When setting up things like the number of histogram slots, be
careful to copy exactly what the examples do in both your .bpf.c
and in your YAML, mysterious '+ 1's and all.)</p>
</div>
Some notes about the Cloudflare eBPF Prometheus exporter for Linux2024-03-11T00:26:22Z2024-03-08T04:01:56Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/UbuntuKernelsZFSWhereFromcks<div class="wikitext"><p>One of the interesting and convenient things about Ubuntu for
people like <a href="https://support.cs.toronto.edu/">us</a> is that they
provide pre-built and integrated ZFS kernel modules in their
mainline kernels. If you want ZFS on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">your (our) ZFS fileservers</a>, you don't have to add any extra PPA
repositories or install any extra kernel module packages; it's just
there. However, this leaves us with <a href="https://mastodon.social/@cks/112041217999758599">a little mystery</a>, which is how
the ZFS modules actually get there. The reason this is a mystery
is that <strong>the ZFS modules are not in the Ubuntu kernel source</strong>,
or at least not in the package source.</p>
<p>(One reason this matters is that you may want to see what patches
Ubuntu has applied to their version of ZFS, because Ubuntu periodically
backports patches to specific issues from upstream OpenZFS. If you
go try to find ZFS patches, ZFS code, or a ZFS changelog in the
regular Ubuntu kernel source, you will likely fail, and this will not
be what you want.)</p>
<p>Ubuntu kernels are normally signed in order to work with <a href="https://wiki.debian.org/SecureBoot">Secure
Boot</a>. If you use 'apt source
...' on a signed kernel, what you get is not the kernel source but
a 'source' that fetches specific unsigned kernels and does magic
to sign them and generate new signed binary packages. To actually
get the kernel source, you need to follow the directions in <a href="https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel">Build
Your Own Kernel</a>
to get the source of the unsigned kernel package. However, as
mentioned this kernel source does not include ZFS.</p>
<p>(You may be tempted to fetch the Git repository following the
directions in <a href="https://wiki.ubuntu.com/Kernel/Dev/KernelGitGuide#Kernel.2FAction.2FGitTheSource.Obtaining_the_kernel_sources_for_an_Ubuntu_release_using_git">Obtaining the kernel sources using git</a>,
but in my experience this may well leave you hunting around in
confusing to try to find the branch that actually corresponds to
even the current kernel for an Ubuntu release. Even if you have the
Git repository cloned, downloading the source package can be easier.)</p>
<p>How ZFS modules get into the built Ubuntu kernel is that during the
package build process, <strong>the Ubuntu kernel build downloads or copies
a specific <code>zfs-dkms</code> package version and includes it in the tree
that kernel modules are built from</strong>, which winds up including the
built ZFS kernel modules in the binary kernel packages. Exactly
what version of zfs-dkms will be included is specified in
<a href="https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/tree/debian/dkms-versions?h=Ubuntu-5.15.0-88.98">debian/dkms-versions</a>,
although good luck finding an accurate version of that file in the
Git repository on any predictable branch or in any predictable
location.</p>
<p>(The zfs-dkms package itself is the <a href="https://en.wikipedia.org/wiki/Dynamic_Kernel_Module_Support">DKMS</a> version
of kernel ZFS modules, which means that it packages the source code
of the modules along with directions for how DKMS should (re)build
the binary kernel modules from the source.)</p>
<p>This means that if you want to know what specific version of the
ZFS code is included in any particular Ubuntu kernel and what changed
in it, you need to look at the source package for zfs-dkms, which
is called <a href="https://code.launchpad.net/ubuntu/+source/zfs-linux">zfs-linux</a>
and has its Git repository <a href="https://git.launchpad.net/ubuntu/+source/zfs-linux">here</a>. Don't ask me
how the branches and tags in the Git repository are managed and how
they correspond to released package versions. My current view is
that I will be downloading specific zfs-linux source packages as
needed (using 'apt source zfs-linux').</p>
<p>The zfs-linux source package is also used to build the zfsutils-linux
binary package, which has the user space ZFS tools and libraries.
You might ask if there is anything that makes zfsutils-linux versions
stay in sync with the zfs-dkms versions included in Ubuntu kernels.
The answer, as far as I can see, is no. Ubuntu is free to release
new versions of zfsutils-linux and thus zfs-linux without updating
the kernel's dkms-versions file to use the matching zfs-dkms version.
Sufficiently cautious people may want to specifically install a
matching version of zfsutils-linux and then hold the package.</p>
<p>I was going to write something about how you get the ZFS source for
a particular kernel version, but it turns out that there is no
straightforward way. Contrary to what the Ubuntu documentation
suggests, if you do 'apt source linux-image-unsigned-$(uname -r)',
you don't get the source package for that kernel version, you get
the source package for the current version of the 'linux' kernel
package, at whatever is the latest released version. Similarly,
while you can inspect that source to see what zfs-dkms version it
was built with, 'apt get source zfs-dkms' will only give you (easy)
access to the current version of the zfs-linux source package. If
you ask for an older version, apt will probably tell you it can't
find it.</p>
<p>(Presumably Ubuntu has old source packages somewhere, but I don't
know where.)</p>
</div>
Where and how Ubuntu kernels get their ZFS modules2024-03-07T04:00:21Z2024-03-07T03:59:21Ztag:cspace@cks.mef.org,2009-03-24:/blog/unix/XWindowsAllTheWayDowncks<div class="wikitext"><p>Every window system has windows, as an entity. Usually we think of these
as being used for, well, windows and window like things; application
windows, those extremely annoying pop-up modal dialogs that are always
interrupting you at the wrong time, even perhaps things like pop-up
menus. In its original state, X has more windows than that. Part of how
and why it does this is that X allows windows to nest inside each other,
in a window tree, which you can still see today with '<code>xwininfo -root
-tree</code>'.</p>
<p>One of the reasons that X has copious nested windows is that X was
designed with a particular model of writing X programs in mind, and
that model made everything into a (nested) window. Seriously,
everything. In an old fashioned X application, windows are everywhere.
Buttons are windows (or several windows if they're radio buttons
or the like), text areas are windows, menu entries are each a
window of their own within the window that is the menu, visible
containers of things are windows (with more windows nested inside
them), and so on.</p>
<p>This copious use of windows allows a lot of things to happen on the
server side, because various things (like mouse cursors) are defined
on a per-window basis, and also <a href="https://www.x.org/releases/X11R7.7/doc/xproto/x11protocol.html#requests:CreateWindow">windows can be created with things
like server-set borders</a>.
So the X server can render sub-window borders to give your buttons
an outline and automatically change the cursor when the mouse moves
into and out of a sub-window, all without the client having to do
anything. And often input events like mouse clicks or keys can be
specifically tied to some sub-window, so your program doesn't have
to hunt through its widget geometry to figure out what was clicked.
There are more tricks; for example, you can get 'enter' and 'leave'
events when the mouse enters or leaves a (sub)window, which programs
can use to highlight the current thing (ie, subwindow) under the
cursor without the full cost of constantly tracking mouse motion
and working out what widget is under the cursor every time.</p>
<p>The old, classical X toolkits like <a href="https://en.wikipedia.org/wiki/X_Toolkit_Intrinsics">Xt</a> and <a href="https://en.wikipedia.org/wiki/X_Athena_Widgets">the
Athena widget set (Xaw)</a>
heavily used this 'tree of nested windows' approach, and you can
still see large window trees with '<code>xwininfo</code>' when you apply it
to old applications with lots of visible buttons; one example is
'xfontsel'. Even the venerable xterm normally contains a nested
window (for the scrollbar, which I believe it uses partly to
automatically change the X cursor when you move the mouse into the
scrollbar). However, this doesn't seem to be universal; when I look
at <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/ToolsXrun">one Xaw-based application</a> I have handy,
it doesn't seem to use subwindows despite having <a href="https://www.x.org/releases/current/doc/libXaw/libXaw.html#List_Widget">a list widget
of things to click on</a>.
Presumably in Xaw and perhaps Xt it depends on what sort of widget
you're using, with some widgets using sub-windows and some not.
<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/ToolsPyhosts">Another program</a>, written using <a href="https://www.tcl.tk/">Tk</a>, does use subwindows for its buttons (with
them clearly visible in '<code>xwininfo -tree</code>').</p>
<p>This approach fell out of favour for various reasons, but certainly
one significant one is that it's strongly tied to <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/XRenderingVsWaylandRendering">X's server side
rendering</a>. Because these subwindows
are 'on top of' their parent (sub)windows, they have to be rendered
individually; otherwise they'll cover what was rendered into the
parent (and naturally they clip what is rendered to them to their
visible boundaries). If you're sending rendering commands to the
server, this is just a matter of what windows they're for and what
coordinates you draw at, but if you render on the client, you have
to ship over a ton of little buffers (one for each sub-window)
instead of one big one for your whole window, and in fact you're
probably sending extra data (the parts of all of the parent windows
that gets covered up by child windows).</p>
<p>So in modern toolkits, the top level window and everything in it
is generally only one X window with no nested subwindows, and all
buttons and other UI elements are drawn by the client directly into
that window (usually with client side drawing). The client itself
tracks the mouse pointer and sends 'change the cursors to <X>' requests
to the server as the pointer moves in and out of UI elements that
should have different mouse cursors, and when it gets events, the
client searches its own widget hierarchy to decide what should handle
them (possibly including <a href="https://en.wikipedia.org/wiki/Client-side_decoration">client side window decorations (CSD)</a>).</p>
<p>(I think toolkits may create some invisible sub-windows for event
handling reasons. Gnome-terminal and other Gnome applications appear to
create a 1x1 sub-window, for example.)</p>
<p>As a side note, another place you can still find this many-window
style is in some old fashioned X window managers, such as
<a href="https://fvwm.org/">fvwm</a>. When fvwm puts a frame around a
window (such as the ones visible on windows on <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/MyDesktopTour">my desktop</a>), the specific elements of the frame
(the title bar, any buttons in the title bar, the side and corner
drag-to-resize areas, and so on) are all separate X sub-windows. One
thing I believe this is used for is to automatically show an appropriate
mouse cursor when the mouse is over the right spot. For example, if
your mouse is in the right side 'grab to resize right' border, the mouse
cursor changes to show you this.</p>
<p>(The window managers for modern desktops, like Cinnamon, don't handle
their window manager decorations like this; they draw everything as
decorations and handle the 'widget' nature of title bar buttons and so
on internally.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/XWindowsAllTheWayDown?showcomments#comments">3 comments</a>.) </div>A peculiarity of the X Window System: Windows all the way down2024-03-06T02:27:32Z2024-03-06T02:26:30Ztag:cspace@cks.mef.org,2009-03-24:/blog/unix/XServerBackingStoreOptionalcks<div class="wikitext"><p>In a comment on <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/XRenderingVsWaylandRendering">yesterday's entry talking about X's server side
graphics rendering</a>, B.Preston mentioned
that another reason for this was to conserve memory. This is very
true. In general, X is extremely conservative about requiring
memory, sometimes to what we now consider extreme lengths, and there
are specific protocol features (or limitations) related to this.</p>
<p>The modern approach to multi-window graphics rendering is that each
window renders into a buffer that it owns (often with hardware
assistance) and then the server composites (appropriate parts of)
all of these buffers together to make up the visible screen. Often
this compositing is done in hardware, enabling you to spin a cube
of desktops and their windows around in real time. One of the things
that clients simply don't worry about (at least for their graphics)
is what happens when someone else's window is partially or completely
on top of their window. From the client's perspective, nothing
happens; they keep drawing into their buffer and their buffer is
just as it was before, and all of the occlusion and stacking and
so on are handled by the composition process.</p>
<p>(In this model, a client program's buffer doesn't normally get changed
or taken away behind the client's back, although the client may
flip between multiple buffers, only displaying one while completely
repainting another.)</p>
<p>The X protocol specifically does not require such memory consuming
luxuries as a separate buffer for each window, and early X
implementations did not have them. An X server might have only one
significant-sized buffer, that being screen memory itself, and X
clients drew right on to their portion of the screen (by sending
the X server drawing commands, because they didn't have direct
access to screen memory). The X server would carefully clip client
draw operations to only touch the visible pixels of the client's
window. When you moved a window to be on top of part of another
window, the X server simply threw away (well, overwrote) the 'under'
portion of the other window. When the window on top was moved back
away again, the X server mostly dealt with this by sending your
client <a href="https://www.x.org/releases/X11R7.7/doc/xproto/x11protocol.html#events:Expose">a notification</a>
that parts of its window had become visible and the client should
repaint them.</p>
<p>(X was far from alone with this model, since at the time almost everyone
was facing similar or worse memory constraints.)</p>
<p>The problem with this 'damage and repaint' model is that it can be
janky; when a window is moved away, you get an ugly result until
the client has had the time to do a redraw, which may take a while.
So the X server had some additional protocol level features, called
'backing store' and 'save-under(s)'. If a given X server supported
these (and it didn't have to), the client could request (usually
during <a href="https://www.x.org/releases/X11R7.7/doc/xproto/x11protocol.html#requests:CreateWindow">window creation</a>)
that the server maintain a copy of the obscured bits of the new
window when it was covered by something else (<a href="https://tronche.com/gui/x/xlib/window/attributes/backing-store.html">'backing store'</a>) and
separately that when this window covered part of another window,
the obscured parts of that window should be saved (<a href="https://tronche.com/gui/x/xlib/window/attributes/save-under.html">'save-under'</a>, which
you might set for a transient pop-up window). Even if the server
supported these features in general it could specifically stop doing
them for you at any time it felt like it, and your client had to
cope.</p>
<p>(The X server can also give your window backing store whether or not you
asked for it, at its own discretion.)</p>
<p>All of this was to allow an X server to flexibly manage the amount
of memory it used on behalf of clients. If an X server had a lot
of memory, it could give everything backing store; if it started
running short, it could throw some or all of the backing store out
and reduce things down to (almost) a model where the major memory
use was the screen itself. Even today you can probably arrange to
start an X server in a mode where it doesn't have backing store
(the '<code>-bs</code>' command line option, cf <a href="https://www.x.org/releases/X11R7.7/doc/man/man1/Xserver.1.xhtml">Xserver(1)</a>, which
you can try in Xnest or the like today, and also '<code>-wm</code>'). I have
a vague memory that back in the day there were serious arguments
about whether or not you should disable backing store in order to
speed up your X server, although I no longer have any memory about
why that would be so (<a href="https://rainbow.ldeo.columbia.edu/documentation/sgi-faq/graphics/66.html">but see</a>).</p>
<p>As far as I know all X servers normally operate with backing store
these days. I wouldn't be surprised if some modern X clients would
work rather badly if you ran them on an X server that had backing
store forced off (much as I suspect that few modern programs will
cope well with <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/X11PseudocolorAndWMs">PseudoColor displays</a>).</p>
<p>PS: Now that I look at '<a href="https://www.x.org/releases/X11R7.7/doc/man/man1/xdpyinfo.1.xhtml"><code>xdpyinfo</code></a>', my
X server reports 'options: backing-store WHEN MAPPED, save-unders
NO'. I suspect that this is a common default, since you don't really
need save-unders if everything has backing store enabled when it's
visible (well, in X <a href="https://tronche.com/gui/x/xlib/window/map.html">mapped</a> is not quite
'visible', <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/XOffscreenWindowsUse">cf</a>, but close enough).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/XServerBackingStoreOptional?showcomments#comments">One comment</a>.) </div>An illustration of how much X cares about memory usage2024-03-05T03:06:37Z2024-03-05T03:02:53Ztag:cspace@cks.mef.org,2009-03-24:/blog/unix/XRenderingVsWaylandRenderingcks<div class="wikitext"><p>Recently, Thomas Adam (of <a href="https://fvwm.org/">fvwm</a> fame) pointed
out on the FVWM mailing list (<a href="https://marc.info/?l=fvwm&m=170741836229965&w=2">here</a>, <a href="https://marc.info/?l=fvwm&m=170699825227476&w=2">also</a>) a difference
between X and Wayland that I'd been vaguely aware of before but
hadn't actually thought much about. Today I feel like writing it
down in my own words for various reasons.</p>
<p>X is a very old protocol (dating from the mid to late 1980s), and
one aspect of that is that it contains things that modern graphics
protocols don't. From a modern point of view, it isn't wrong to
describe X as several protocols in a trenchcoat. Two of the largest
such protocols are one for what you could call window management
(including event handling) and a second one for graphics rendering.
In the original vision of X, clients used the X server as their
rendering engine, sending a series of 2D graphics commands to the
server to draw things like lines, rectangles, arcs, and <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/ModernXFontDrawback">text</a>. In the days of 10 Mbit/second local area
networks and also slow inter-process communication on your local
Unix machine, this was a relatively important part of both <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/XNetworkTransparencyFailure">X's
network transparency story</a> and X's
performance in general. We can call this <em>server (side) rendering</em>.</p>
<p>(If you look at the X server drawing APIs, you may notice that they're
rather minimal and generally lack features that you'd like to do modern
graphics. Some of this was semi-fixed in X protocol extensions, but in
general the server side X rendering APIs are rather 1980s.)</p>
<p>However, X clients didn't have to do their rendering in the server.
Right from the beginning they could render to a bitmap on the client
side and then shove the bitmap over to the server somehow (the exact
mechanisms depend on what X extensions are available). Over time,
more and more clients started doing more and more <em>client (side)
rendering</em>, where they rendered everything under their own control
using their own code (well, realistically a library or a stack of
them, especially for complex things like <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/ModernXFontDrawback">rendering fonts</a>). Today, many clients and many common client
libraries are entirely or almost entirely using client side rendering,
in part to get modern graphics features that people want, and these
days clients even do <a href="https://en.wikipedia.org/wiki/Client-side_decoration">client side (window) decoration (CSD)</a>, where they
draw 'standard' window buttons themselves.</p>
<p>(This tends to make window buttons not so standard any more,
especially across libraries and toolkits.)</p>
<p>As a protocol designed relatively recently, Wayland is not several
protocols in a trenchcoat. Instead, the (core) Wayland protocol is
only for window management (including event handling), and it has
no server side rendering. Wayland clients have to do client side
rendering in order to display anything, using whatever libraries
they find convenient for this. Of course this 'rendering' may be a
series of OpenGL commands that are drawn on to a buffer that's
shared with the Wayland server (what is called <em>direct rendering</em>
(<a href="https://wayland.freedesktop.org/docs/html/ch03.html">cf</a>), which
is also the common way to do client side rendering in X), but this
is in some sense a detail. Wayland clients can simply render to
bitmaps and then push those bitmaps to a server, and I believe this
is part of how <a href="https://gitlab.freedesktop.org/mstoeckl/waypipe">waypipe</a>
operates under the covers.</p>
<p>(Since Wayland was more or less targeted at environments with
toolkits that already had their own graphics rendering APIs and
were already generally doing client side rendering, this wasn't
seen as a drawback. My impression is that these non-X graphics APIs
were already in common use in many modern clients, since it includes
things like <a href="https://www.cairographics.org/">Cairo</a>. One reason
that people switched to such libraries and their APIs even before
Wayland is that the X drawing APIs are, well, very 1980s, and don't
have a lot of features that modern graphics programming would like.
And you can draw directly to a Wayland buffer if you want to, <a href="https://bugaevc.gitbooks.io/writing-wayland-clients/content/beyond-the-black-square/drawing.html">cf
this example</a>.)</p>
<p>One implication of this is that some current X programs are much
easier to port (or migrate) to Wayland than others. The more an X
program uses server side X rendering, the more it can't simply be
re-targeted to Wayland, because it needs a client side library to
substitute for the X server side rendering functionality. Generally
such programs are either old or were deliberately written to be
minimal X clients that didn't depend on toolkits like Gtk or even
Cairo.</p>
<p>(Substituting in a stand alone client side drawing library is
probably not a small job, since I don't think any of them so far
are built to be API compatible with the relevant X APIs. It also
means taking on additional dependencies for your program, although
my impression is that some basic graphics libraries are essentially
standards by now.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/XRenderingVsWaylandRendering?showcomments#comments">2 comments</a>.) </div>X graphics rendering as contrasted to Wayland rendering2024-03-04T03:56:36Z2024-03-04T03:56:12Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/ServerCPUDensityAndRAMLatencycks<div class="wikitext"><p>When I wrote about <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/ServersSpeedOfChangeDown">how the speed of improvement in servers may
have slowed down</a>, I didn't address CPU
core counts, which is one area where the numbers have been going
up significantly. Of course you have to keep those cores busy, but
if you have a bunch of CPU-bound workloads, the increased core count
is good for you. Well, it's good for you if your workload is genuinely
CPU bound, which generally means it fits within per-core caches.
One of the areas I don't know much about is how the increasing CPU
core counts interact with RAM latency.</p>
<p>RAM latency (for random requests) has been relatively flat for a
while (it's been flat in time, which means that it's been going up
in cycles as CPUs got faster). Total memory access latency has
apparently been 90 to 100 nanoseconds for several memory generations
(although <a href="https://en.wikipedia.org/wiki/DDR5_SDRAM">individual DDR5 memory module access is apparently only
part of this</a>, <a href="https://www.crucial.com/articles/about-memory/everything-about-ddr5-ram">also</a>).
Memory bandwidth has been going up steadily between the DDR
generations, so per-core bandwidth has gone up nicely, but this is
only nice if you have the kind of sequential workloads that benefit
from it. As far as I know, the kind of random access that you get
from things like pointer chasing is all dependent on latency.</p>
<p>(If the total latency has been basically flat, this seems to imply
that bandwidth improvements don't help too much. Presumably they
help for successive non-random reads, and my vague impression is
that reading data from successive addresses from RAM is faster than
reading random addresses (and not just because RAM typically transfers
an entire cache line to the CPU at once).)</p>
<p>So now we get to the big question: how many memory reads can you
have in flight at once with modern DDR4 or DDR5 memory, especially
on servers? Where the limit is presumably matters since if you have
a bunch of pointer-chasing workloads that are limited by 'memory
latency' and you run them on a high core count system, at some point
it seems that they'll run out of simultaneous RAM read capacity.
I've tried to do some reading and gotten confused, which may be
partly because modern DRAM is a pretty complex thing.</p>
<p>(I believe that individual processors and multi-socket systems have
some number of memory channels, each of which can be in action
simultaneously, and then there are <a href="https://en.wikipedia.org/wiki/Memory_rank">memory ranks</a> (<a href="https://www.crucial.com/support/articles-faq-memory/what-is-a-memory-rank">also</a>)
and <a href="https://en.wikipedia.org/wiki/Memory_bank">memory banks</a>. How
many memory channels you have depends partly on the processor you're
using (well, its memory controller) and partly on the motherboard
design. For example, 4th generation AMD Epyc processors apparently
support 12 memory channels, although not all of them may be populated
in a given memory configuration (<a href="https://www.phoronix.com/review/ddr5-epyc-9004-genoa">cf</a>). I think
you need at least N (or maybe 2N) DIMMs for N channels. And <a href="https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memory-subsystem-and-conclusion/">here's
a look at AMD Zen4 memory stuff</a>,
which doesn't seem to say much on multi-core random access latency.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/ServerCPUDensityAndRAMLatency?showcomments#comments">2 comments</a>.) </div>Something I don't know: How server core count interacts with RAM latency2024-03-03T03:55:27Z2024-03-03T03:54:58Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/GrafanaMetricsNameChangeOptionscks<div class="wikitext"><p>In an ideal world, your metrics never change their names; once you
put them into a Grafana dashboard panel, they keep the same name
and meaning forever. In the real world, sometimes a change in metric
name is forced on you, for example because you might have to move
from collecting a metric through <a href="https://github.com/cloudflare/ebpf_exporter">one Prometheus exporter</a> to collecting it with
<a href="https://github.com/prometheus/node_exporter">another exporter</a>
which naturally gives it a different name. And sometimes a metric
will be renamed by its source.</p>
<p>In a <a href="https://prometheus.io/">Prometheus</a> environment, the very
brute force way to deal with this is either a recording rule (creating
a duplicate metric with the old name) or renaming the metric during
ingestion. However I feel that this is generally a mistake. Almost
always, your Prometheus metrics should record the true state of
affairs, warts and all, and it should be on other things to sort
out the results.</p>
<p>(As part of this, I feel that Prometheus metric names should always
be honest about where they come from. There's a convention that the
name of the exporter is at the start of the metric name, and so you
shouldn't generate your own metrics with someone else's name on them.
If a metric name starts with '<code>node_*</code>', it should come from <a href="https://github.com/prometheus/node_exporter">the
Prometheus host agent</a>.)</p>
<p>So if your Prometheus metrics get renamed, you need to fix this in
your Grafana panels (which can be a pain but is better in the long
run). There are at least three approaches I know of. First, you can
simply change the name of the metric in all of the panels. This
keeps things simple but means that your historical data stops being
visible on the dashboards. If you don't keep historical data for
very long (or don't care about it much), this is fine; pretty soon
the metric's new name will be the only one in your metrics database.
In our case, we keep years of data and do want to be able to look
back, so this isn't good enough.</p>
<p>The second option is to write your queries in Grafana as basically
'<code>old_name or new_name</code>'. If your queries involve rate() and avg()
and other functions, this can be a lot of (manual) repetition, but
if you're careful and lucky you can arrange for the old and the new
query results to have the same labels as Grafana sees them, so your
panel graphs will be continuous over the metrics name boundary.</p>
<p>The third option is to duplicate the query and then change the name
of the metric (or the metrics) in the new copy of the query. This
is usually straightforward and easy, but it definitely gives you
graphs that aren't continuous around the name change boundary. The
graphs will have one line for the old metric and then a new second
line for your new metric. One advantage of separate queries is that
you can someday turn the old query off in Grafana without having
to delete it.</p>
</div>
Options for your Grafana panels when your metrics change names2024-03-02T04:34:04Z2024-03-02T04:33:03Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/ServersSpeedOfChangeDowncks<div class="wikitext"><p>One of the bits of technology news that I saw recently was that AWS
was changing how long it ran servers, from five years to six years.
Obviously one large motivation for this is that it will save Amazon
a nice chunk of money. However, I suspect that one enabling factor
for this is that old servers are more similar to new servers than
they used to be, as part of what could be called the great slowdown
in computer performance improvement.</p>
<p>New CPUs and to a lesser extent memory are somewhat better than
they used to be, both on an absolute measure and on a performance
per watt basis, but the changes aren't huge the way they used to
be. SATA SSD performance has been more or less stagnant for years;
NVMe performance has improved, but from a baseline that was already
very high, perhaps higher than many workloads could take advantage
of. Network speeds are potentially better but it's already hard to
truly take advantage of 10G speeds, especially with ordinary workloads
and software.</p>
<p>(I don't know if SAS SSD bandwidth and performance has improved,
although raw SAS bandwidth has and is above what SATA can provide.)</p>
<p>For both AWS and people running physical servers (like <a href="https://support.cs.toronto.edu/">us</a>) there's also the question of how
many people need faster CPUs and more memory, and related to that,
how much they're willing to pay for them. It's long been observed
that a lot of what people run on servers is not a voracious consumer
of CPU and memory (and IO bandwidth). If your VPS runs at 5% or 10%
CPU load most of the time, you're probably not very enthused about
paying more for a VPS with a faster CPU that will run at 2.5% almost
all of the time.</p>
<p>(Now that I've written this it strikes me that this is one possible
motivation for cloud providers to push 'function as a service'
computing, because it potentially allows them to use those faster
CPUs more effectively. If they're renting you CPU by the second and
only when you use it, faster CPUs likely mean more people can be
packed on to the same number of CPUs and machines.)</p>
<p><a href="https://support.cs.toronto.edu/">We</a> have a few uses for very
fast single-core CPU performance, but other than those cases (and
<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SlurmHowWeUseIt">our compute cluster</a>) it's hard to
identify machines that could make much use of faster CPUs than they
already have. It would be nice if <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our fileservers</a> had <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/ServerNVMeU2U3AndOthers2022">U.2 NVMe drives</a> instead of SATA SSDs but I'm not sure
we'd really notice; the fileservers only rarely see high IO loads.</p>
<p>PS: It's possible that I've missed important improvements here
because I'm not all that tuned in to this stuff. One possible area
is PCIe lanes directly supported by the system's CPU(s), which
enable all of those fast NVMe drives, multiple 10G or faster network
connections, and so on.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/ServersSpeedOfChangeDown?showcomments#comments">2 comments</a>.) </div>The speed of improvement in servers may have slowed down2024-03-01T03:44:15Z2024-03-01T03:43:13Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/PrometheusAbsentMetricsAndLabelscks<div class="wikitext"><p>When you have <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">a Prometheus setup</a>,
one of the things you sooner or later worry about is important
metrics quietly going missing because they're not being reported
any more. There can be many reasons for metrics disappearing on
you; for example, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusCheckingNetworkInterfaces">a network interface you expect to be at 10G
speeds</a> may not be there at
all any more, because it got renamed at some point, so now you're
not making sure the new name is at 10G.</p>
<p>(This happened to us with one machine's network interface, although
I'm not sure exactly how except that it involves the depths of
PCIe enumeration.)</p>
<p>The standard Prometheus feature for this is the '<a href="https://prometheus.io/docs/prometheus/latest/querying/functions/#absent"><code>absent()</code></a>'
function, or sometimes <a href="https://prometheus.io/docs/prometheus/latest/querying/functions/#absent_over_time"><code>absent_over_time()</code></a>.
However, both of these have the problem that because of Prometheus's
data model, you need to know at least some unique labels that your
metrics are supposed to have. Without labels, all you can detect
is a total disappearance of the metric at all, if nothing at all
is reporting the metric. If you want to be alerted when some machine
stops reporting a metric, you need to list all of the sources that
should have the metric (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusCheckAFewMetrics">following a pattern we've seen before</a>):</p>
<blockquote><pre style="white-space: pre-wrap;">
absent(metric{host="a", device="em0"}) or
absent(metric{host="b", device="eno1"}) or
absent(metric{host="c", device="eth2"})
</pre>
</blockquote>
<p>Sometimes you don't know all of the label values that your metric
be present with (or it's tedious to list all of them and keep them
up to date), and it's good enough to get a notification if a metric
disappears when it was previously there (for a particular set of
labels). For example, you might have an assortment of scripts that
put their success results to somewhere and you don't want to have
to keep a list of all of the scripts, but you do want to detect
when a script stops reporting its metrics. In this case we can use
'<a href="https://prometheus.io/docs/prometheus/latest/querying/basics/#offset-modifier"><code>offset</code></a>'
to check current metrics against old metrics. The simplest pattern
is:</p>
<blockquote><pre style="white-space: pre-wrap;">
your_metric offset 1h
unless your_metric
</pre>
</blockquote>
<p>If the metric was there an hour ago and isn't there now, this will
generate the metric as it was an hour ago (with the labels it had
then), and you can use that to drive an alert (or at least <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusDoingRebootAlerts">a
notification</a>). If there are labels
that might naturally change over time in your_metric, you can
exclude them with 'unless ignoring (...)' or use 'unless on (...)'
for a very focused result.</p>
<p>As written this has the drawback that it only looks at what versions
of the metric were there exactly an hour ago. We can do better by
using an *_over_time() function, for example:</p>
<blockquote><pre style="white-space: pre-wrap;">
max_over_time( your_metric[4h] ) offset 1h
unless your_metric
</pre>
</blockquote>
<p>Now if your metric existed (with some labels) at any point between
five hours ago and one hour ago, and doesn't exist now, this
expression will give you a result and you can alert on that. Since
we're using *_over_time(), you can also leave off the 'offset
1h' and just extend the time range, and then maybe extend the other
time range too:</p>
<blockquote><pre style="white-space: pre-wrap;">
max_over_time( your_metric[12h] )
unless max_over_time( your_metric[20m] )
</pre>
</blockquote>
<p>This expression will give you a result if your_metric has been
present (with a given set of labels) at some point in the last 12
hours but has not been present within the last 20 minutes.</p>
<p>(You'd pick the particular *_over_time() function to use depending
on what use, if any, you have for the value of the metric in your
alert. If you have no particular use for the value (or you expect
the value to be a constant), either max or min are efficient for
Prometheus to compute.)</p>
<p>All of these clever versions have a drawback, which is that after
enough time has gone by they shut off on their own. Once the metric
has been missing for at least an hour or five hours or 12 hours or
however long, even the first part of the expression has nothing and
you get no results and no alert. So this is more of a 'notification'
than a persistent 'alert'. That's unfortunately the best you can
really do. If you need a persistent alert that will last until you
take it out of your alert rules, you need to use absent() and
explicitly specify the labels you expect and require.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusAbsentMetricsAndLabels?showcomments#comments">One comment</a>.) </div>Detecting absent Prometheus metrics without knowing their labels2024-03-02T17:03:11Z2024-02-29T03:18:09Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/WhyNoMachineInventorycks<div class="wikitext"><p>As part of <a href="https://mastodon.social/@cks/111921187565748942">thinking about how we configure machines to monitor
and what to monitor on them</a>, I mentioned in
passing that we don't generate this information from some central
machine inventory because <a href="https://mastodon.social/@cks/111921646723702160">we don't have a single source of truth
for a machine inventory</a>.
This isn't to say that we don't have any inventory of our machines;
instead, the problem is that we have too many inventories, each
serving somewhat different purposes.</p>
<p>The core reason that we have wound up with many different lists of
machines is that we use many different tools and systems that need
to have lists of machines and each of them has a different input
format and input sources. It's technically possible to generate all
of these different lists of machines for different programs and
tools from some single master source, but by and large you get to
build, manage, maintain both the software for the master source and
the software to extract and reformat all of the machine lists for
the various programs that need them. In many cases (certainly in
ours), this adds extra work over just maintaining N lists of machines
for N programs and subsystems.</p>
<p>(It also generally means maintaining a bespoke custom system for
your environment, which is a constant ongoing expense in various
ways.)</p>
<p>So we have all sorts of lists of machines, for a broad view of
what a machine is. Here's an incomplete list:</p>
<ul><li>DNS entries (all of our servers have static IPs), but not all DNS
entries still exist as hardware, much less hardware that is turned
on. In addition, we have DNS entries for various IP aliases and other
things that aren't unique machines.<p>
(We'd have more confusion if we used virtual machines, but all of
our production machines are on physical hardware.)<p>
</li>
<li>NFS export permissions for hosts that can do NFS mounts from <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our
fileservers</a>, but not all of our
active machines can do this and there are some listed host names
that are no longer turned on or perhaps even still in DNS.<p>
(NFS export permissions aren't uniform between hosts; some have
extra privileges.)<p>
</li>
<li>Hosts that we have established SSH host keys for. This includes
hosts that aren't currently in service and may never be in service
again.<p>
</li>
<li>Ubuntu machines that are updated by <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/UbuntuOurUpdateSystem">our bulk updates system</a>, which is driven by another 'list
of machines' file that is also used for some other bulk operations.
But this data file omits various machines we don't manage that
way (or at best only belatedly includes them), and while it tracks
some machine characteristics it doesn't have all of them.<p>
(And sometimes we forget to add machines to this data file, which
we at least get a notification about. Well, for Ubuntu machines.)<p>
</li>
<li>Unix machines that we monitor in various ways in <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">our Prometheus
system</a>. These machines may be ping'd,
have their SSH port checked to see if it answers, run the Prometheus
host agent, and run additional agents to export things like GPU
metrics, depending on what the machine is.<p>
Not all turned-on machines are monitored by Prometheus for various
reasons, including that they are test or experimental machines.
And temporarily turned off machines tend to be temporarily removed
to reduce alert and dashboard noise.<p>
</li>
<li>Our <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/ConsoleServerSetup">console server</a> has a whole configuration
file of what machines have a serial console and how they're configured
and connected up. Turned-off machines that are still connected to the
console server remain in this configuration file, and they can then
linger even after being de-cabled.</li>
<li>We mostly use 'smart'
<a href="https://en.wikipedia.org/wiki/Power_distribution_unit">PDUs</a> that
can selectively turn outlets off, which means that we track what
machine is on what PDU port. This is tracked both in a master file
and in the PDU configurations (they have menus that give text labels
to ports).<p>
</li>
<li>A 'server inventory' of where servers are physically located
and other basic information about the server hardware, generally
including a serial number. Not all racked physical servers are
powered on, and not all powered on servers are in production.</li>
<li>Some degree of network maps, to track what servers are connected
to what switches for troubleshooting purposes.<p>
</li>
<li>Various forms of server purchase records with details about the
physical hardware, including serial numbers, which <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/UniversityDisposalProblem">we have to
keep in order to be able to get rid of the hardware later</a>. This doesn't include the
current host name (if any) that the hardware is currently being
used for, or where the hardware is (currently) located.</li>
</ul>
<p>If we assigned IPs to servers through DHCP, we'd also have DHCP
configuration files. These would have to track servers by another
identity, their Ethernet address, which would in turn depend on
what networking the server was using. If we switched a server from
1G networking to 10G networking by putting a 10G card in it, we'd
have to change the DHCP MAC information for the server but nothing
else about it would change.</p>
<p>There's also confusion over what exactly 'a machine' is, partly
because different pieces care about different aspects. We assign
DNS host names to roles, not to physical hardware, but the role is
implemented in some chunk of physical hardware and sometimes the
details of that hardware matter. This leads to more potential
confusion in physical hardware inventories, because sometimes we
want to track that a particular piece of hardware was 'the old <X>'
in case we have to fall back to that older OS for some reason.</p>
<p>(And sometimes we have pre-racked spare hardware for some important
role and so what hardware is live in that role and what is the spare
can swap around.)</p>
<p>We could put all of this information in a single database (probably in
multiple tables) and then try to derive all of the various configuration
files from it. But it clearly wouldn't be simple (and some of it would
always have to be manually maintained, such as the physical location of
hardware). If there is off the shelf open source software that will do
a good job of handling this, it's quite likely that setting it up (and
setting up our inventory schema) would be fairly complex.</p>
<p>Instead, the natural thing to do in our environment when you need a new
list of machines for some purpose (for example, when you're setting
up a new monitoring system) is to set up a new configuration file for
it, possibly deriving the list of machines from another, existing
source. This is especially natural if the tool you're working with
already has its own configuration file format.</p>
<p>(If our lists of machines had to change a lot it might be tempting
to automatically derive some of the configuration files from
'upstream' data. But generally they don't, which means that manual
handling is less work because you don't have to build an entire
system to handle errors, special exceptions, and so on.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/WhyNoMachineInventory?showcomments#comments">One comment</a>.) </div>Our probably-typical (lack of) machine inventory situation2024-02-28T04:03:34Z2024-02-28T04:03:18Ztag:cspace@cks.mef.org,2009-03-24:/blog/programming/EmacsMetaXRelevantCommandscks<div class="wikitext"><p>Today I learned about <a href="https://svbck.org/blog/2024-02-24-emacs-find-of-the-day-m-x.html">the M-X command (well, key binding)</a> (<a href="https://fosstodon.org/@svbck/111987384357353958">via</a>), which "[queries
the] user for a command relevant to the current mode, and then
execute it". In other words it's like M-x but it restricts what
commands it offers to relevant ones. What is 'relevant' here? To
quote the docstring:</p>
<blockquote><p>[...] This includes commands that have been marked as being specially
designed for the current major mode (and enabled minor modes), as well
as commands bound in the active local key maps.</p>
</blockquote>
<p>If you're someone like me who has written some Lisp commands to
customize your experience in a major mode like <a href="https://www.gnu.org/software/emacs/manual/html_node/mh-e/index.html">MH-E</a>, you
might wonder how you mark your personal Lisp commands as 'specially
designed' for the relevant major mode.</p>
<p>In modern Emacs, the answer is that this is an extended part of
'<code>(interactive ...)</code>', the normal Lisp form you use to mark your
Lisp functions as commands (things which will be offered in M-x and
can be run interactively). As mentioned in the Emacs Lisp manual
section <a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Using-Interactive.html">Using <code>interactive</code></a>,
'<code>interactive</code>' takes additional arguments to label what modes your
command is 'specially designed' for; more discussion is in <a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Command-Modes.html">Specifying
Modes For Commands</a>.
The basic usage is, say, '<code>(interactive "P" mh-folder-mode)</code>'</p>
<p>If your commands already take arguments, life is simple and you can
just put the modes on the end. But not all commands do (especially
for quick little things you do for yourself). If you have just
'<code>(interactive)</code>', the correct change is to make it '<code>(interactive
nil mh-folder-mode)</code>'; a nil first argument is how you tell
<code>interactive</code> that there is no argument.</p>
<p>(Don't make my initial mistake and assume that '<code>(interactive ""
mh-folder-mode)</code>' will work. That produced a variety of undesirable
results.)</p>
<p>Is it useful to do this, assuming you have personal commands that
are truly specific to a given mode (as I do for commands that operate
on MH messages and the MH folder display)? My views so far are a
decided maybe in my environment.</p>
<p>First, you don't need to do this if your commands have keybindings
in your major mode, because M-X (execute-extended-command-for-buffer)
will already offer any commands that have keybindings. Second, <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/EmacsPackages-2023-11">my
assortment of packages</a> already gives me
quite a lot of selection power to narrow in on likely commands in
plain M-x, provided that I've named them sensibly. The combination
of <a href="https://github.com/minad/vertico">vertico</a>, <a href="https://github.com/minad/marginalia">marginalia</a>, and <a href="https://github.com/oantolin/orderless">orderless</a> let me search for commands
by substrings, easily see a number of my options, and also see part
of their descriptions. So if I know I want something to do with MH
forwarding I can type 'M-x mh forw' and get, among other things,
my function for <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/EmailThreeForwardingFormats">forwarding in 'literal plaintext' format</a>.</p>
<p>With that said, adding the mode to '(interactive)' isn't much work
and it does sort of add some documentation about your intentions
that your future self may find useful. And if you want a more minimal
<a href="https://utcc.utoronto.ca/~cks/space/blog/programming/EmacsUnderstandingCompletion">minibuffer completion</a> experience,
it may be more useful to have a good way to winnow down the selection.
If you use M-X frequently and you have commands you want to be able
to select in it in applicable modes without having them bound to
keys, you really have no choice.</p>
</div>
How to make your GNU Emacs commands 'relevant' for M-X2024-02-27T03:12:26Z2024-02-27T03:11:26Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/OpenSourceCultureAndPublicWorkcks<div class="wikitext"><p>A while back I wrote about how <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/RequirementToScaleYourWork">doing work that scales requires
being able to scale your work</a>, which
in the open source world requires time, energy, and the willingness
to engage in the public sphere of open source regardless of the
other people there and your reception. Not everyone has this sort
of time and energy, and not everyone gets a positive reception by
open source projects even if they have it.</p>
<p>This view runs deep in open source culture, which valorizes public
work even at the cost of stress and time. Open source culture on
the one hand tacitly assumes that everyone has those available, and
on the other hand assumes that if you don't do public work (for
whatever reason) that you are less virtuous or not virtuous at all.
To be a virtuous person in open source is to contribute publicly
at the cost of your time, energy, stress, and perhaps money, and
to not do so is to not be virtuous (sometimes this is phrased as
'not being dedicated enough').</p>
<p>(Often the most virtuous public contribution is 'code', so people who
don't program are already intrinsically not entirely virtuous and lesser
no matter what they do.)</p>
<p>Open source culture has some reason to praise and value 'doing work
that scales', public work; if this work does not get done, nothing
happens. But it also has a tendency to demand that everyone do it and
to judge them harshly when they don't. This is the meta-cultural issue
behind things like <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/BugReportExperienceObligation">the cultural expectations that people will file bug
reports</a>, often no matter what the bug
reporting environment is like or if filing bug reports does any good
(<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/BugReportBenefit">cf</a>).</p>
<p>I feel that this view is dangerous for various reasons, including
because it blinds people to other explanations for a lack of public
contributions. If you can say 'people are not contributing because
they're not virtuous' (or not dedicated, or not serious), then you
don't have to take a cold, hard look at what else might be getting
in the way of contributions. Sometimes such a cold hard look might
turn up rather uncomfortable things to think about.</p>
<p>(Not every project wants or can handle contributions, because they
generally require work from existing project members. But not all
such projects will admit up front in the open that they either don't
want contributions at all or they gatekeep contributions heavily
to reduce time burdens on existing project members. And part of
that is probably because openly refusing contributions is in itself
often seen as 'non-virtuous' in open source culture.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/OpenSourceCultureAndPublicWork?showcomments#comments">One comment</a>.) </div>Open source culture and the valorization of public work2024-02-26T21:43:52Z2024-02-26T04:21:12Ztag:cspace@cks.mef.org,2009-03-24:/blog/programming/GoRangefuncAndUserContainerscks<div class="wikitext"><p>In Go 1.22, the Go developers have made available a "range over
function" experiment, as described in <a href="https://go.dev/wiki/RangefuncExperiment">the Go Wiki's "Rangefunc
Experiment"</a>. Recently I
read a criticism of this, Richard Ulmer's <a href="https://rulmer.xyz/article/Questioning_Gos_range-over-func_Proposal.html">Questioning Go's
range-over-func Proposal</a>
(<a href="https://lobste.rs/s/e2aw1k/questioning_go_s_range_over_func_proposal">via</a>).
As I read Ulmer's article, it questions the utility of the range
over func (proposed) feature based on the grounds that this isn't
a significant enough improvement in standard library functions like
strings.Split (which is given as an example in the "more motivation"
section of <a href="https://go.dev/wiki/RangefuncExperiment">the wiki article</a>).</p>
<p>I'm not unsympathetic to this criticism, especially when it concerns
standard library functionality. If the Go developers want to extend
various parts of the standard library to support streaming their
results instead of providing the results all at once, then there
may well be better, lower-impact ways of doing so, such as developing
a standard API approach or set of approaches for this and then using
this to add new APIs. However, I think that extending the standard
library into streaming APIs is by far the less important side of
the "range over func" proposal (although this is what the "more
motivation" section of <a href="https://go.dev/wiki/RangefuncExperiment">the wiki article</a> devotes the most space
to).</p>
<p>Right from the beginning, one of the criticisms of Go was that it
had some privileged, complex builtin types that couldn't be built
using normal Go facilities, such as maps. Generics have made it
mostly possible to do equivalents of these (generic) types yourself
at the language level (although the Go compiler still uniquely
privileges maps and other builtin types at the implementation level).
However, these complex builtin types still retain some important
special privileges in the language, and one of them is that they
were the only types that you could write convenient 'range' based
for loops.</p>
<p>In Go today you can write, for example, a set type or a key/value
type with some complex internal storage implementation and make it
work even for user-provided element types (through generics). But
people using your new container types cannot write 'for elem :=
range set' or 'for k, v := range kvstore'. The best you can give
them is an explicit push or pull based iterator based on your type
(in a push iterator, you provide a callback function that is given
each value; in a pull iterator, you repeatedly call some function
to obtain the next value). The "range over func" proposal bridges
this divide, allowing non-builtin types to be ranged over almost
as easily as builtin types. You would be able to write types that
let people write 'for elem := range set.Forward()' or 'for k, v :=
kvstore.Walk()'.</p>
<p>This is an issue that can't really be solved without language
support. You could define a standard API for iterators and iteration
(and the 'iter' package covered in <a href="https://go.dev/wiki/RangefuncExperiment">the wiki article</a> sort of is
that), but it would still be more code and somewhat awkward code
for people using your types to write. People are significantly
attracted to what is easy to program; the more difficult it is to
iterate user types compared to builtin types, the less people will
do it (and the more they will use builtin types even when they
aren't a good fit). If Go wants to put user (generic) types on
almost the same level (in the language) as builtin types, then I
feel it needs some version of a "range over func" approach.</p>
<p>(Of course, you may feel that Go should not prioritize putting user
types on almost the same level as builtin types.)</p>
</div>
The Go 'range over functions' proposal and user-written container types2024-02-26T21:43:52Z2024-02-25T03:30:08Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/DnfFixingStuckUpdateinfocks<div class="wikitext"><p>I apply Fedora updates only by hand, and as part of this I like to
look at what '<code>dnf updateinfo info</code>' will tell me about why they're
being done. For some time, there's been an issue on my work desktop
where 'dnf updateinfo info' would report on updates that I'd already
applied, often drowning out information about the updates that I
hadn't. This was a bit frustrating, because my home Fedora machine
didn't do this but I couldn't spot anything obviously wrong (and at
various times I'd cleaned all of the DNF caches that I could find).</p>
<p>(Now that I look, it seems <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/FedoraNotReadingUpdateinfo">I've been having some variant of this
problem for a while</a>.)</p>
<p>Recently I took another shot at troubleshooting this. In <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OperatorsAndSystemProgrammers">the
system programmer way</a>,
I started by locating the Python source code of the DNF updateinfo
subcommand and reading it. This showed me a bunch of subcommand
specific options that I could have discovered by reading 'dnf
updateinfo --help' and led me to find 'dnf updateinfo list', which
lists which RPM (or RPMs) a particular update will update. When I
used 'dnf updateinfo list' and looked at the list of RPMs, something
immediately jumped out at me, and it turned out to be the cause.</p>
<p><strong>My 'dnf updateinfo info' problems were because I had old Fedora 37
'debugsource' RPMs still installed</strong> (on a machine now running Fedora
39).</p>
<p>The '-debugsource' and '-debuginfo' RPMs for a given RPM contain
symbol information and then source code that is used to allow better
debugging (see <a href="https://docs.fedoraproject.org/en-US/packaging-guidelines/Debuginfo/">Debuginfo packages</a> and
<a href="https://fedoraproject.org/wiki/Changes/SubpackageAndSourceDebuginfo">this change to create debugsource as well</a>). I
tend to wind up installing them if I'm trying to debug a crash in
some standard packaged program, or sometimes code that heavily uses
system libraries. Possibly these packages get automatically cleaned
up if you update Fedora releases in <a href="https://docs.fedoraproject.org/en-US/quick-docs/upgrading-fedora-new-release/">one of the officially supported
ways</a>,
but I do a live upgrade using DNF (following <a href="https://docs.fedoraproject.org/en-US/quick-docs/upgrading-fedora-online/">this Fedora documentation</a>).
Clearly, when I do such an upgrade, these packages are not removed
or updated.</p>
<p>(It's possible that these packages are also not removed or updated
within a specific Fedora release when you update their base packages,
but since they were installed a long time ago I can't tell at this
point.)</p>
<p>With these old debugsource packages hanging around, DNF appears to
have reasonably seen more recent versions of them available and
duly reported the information on the 'upgrade' (in practice the
current version of the package) in 'dnf updateinfo info' when I
asked for it. That the packages would not be updated if I did a
'dnf update' was not updateinfo's problem. Removing the debugsource
packages eliminated this and now 'dnf updateinfo info' is properly
only reporting actual pending updates.</p>
<p>('dnf updateinfo' has various options for what packages to select,
but as covered in <a href="https://dnf.readthedocs.io/en/latest/command_ref.html#updateinfo-command-label">the updateinfo command documentation</a>
apparently they're mostly the same in practice.)</p>
<p>In the future I'm going to have to remember to remove all debugsource
and debuginfo packages before upgrading Fedora releases. Possibly
I should remove them after I'm done with whatever I installed them
for. If I needed them again (in that Fedora release) I'd have to
re-fetch them, but that's rare.</p>
<p>PS: In reading the documentation, I've discovered that it's really
'<code>dnf updateinfo --info</code>'; updateinfo just accepts 'info' (and
'list') as equivalent to the switches.</p>
<p>(This elaborates on <a href="https://mastodon.social/@cks/111967593217874645">a Fediverse post I made at the time</a>.)</p>
</div>
Fixing my problem of a stuck '<code>dnf updateinfo info</code>' on Fedora Linux2024-02-26T21:43:53Z2024-02-24T03:10:47Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/SSHBruteForceAttacksAbruptlyDowncks<div class="wikitext"><p>It's general wisdom in the sysadmin community that if you expose a
SSH port to the Internet, people will show up to poke at it, and
by 'people' I mean 'attackers that are probably mostly automated'.
For several years, the pattern to this that I've noticed was an
apparent combination of two activities. There was a constant
background pitter-patter of various IPs each making a probe once a
minute or less (but for tens of minutes or longer), and then periodic
bursts where a single IP would be more active, sometimes significantly
so.</p>
<p>(Although I can't be sure, I think the rate of both the background
probes and the periodic bursts was significantly up compared to
<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SSHBruteForceAttacksNoMoreHere">how it was a couple of years ago</a>.
Unfortunately <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLokiStartupWALReplayIssue">making direct comparisons is a bit difficult due to
Grafana Loki issues</a>.)</p>
<p>Then there came this past Tuesday, and <a href="https://mastodon.social/@cks/111966826624626573">I noticed something that
I reported on the Fediverse</a>:</p>
<blockquote><p>This is my system administrator's "what is wrong" face when Internet
ssh authentication probes against our systems seem to have fallen off
a cliff, as reported by system logs. We shouldn't be seeing only <em>two</em>
in the last hour.</p>
<p>(The nose dive seems to have started at 6:30 am Eastern and hit
'basically nothing' by 9:30 am.)</p>
</blockquote>
<p>After looking at this longer, the pattern I'm now seeing on <a href="https://support.cs.toronto.edu/">our
systems</a> is basically that the
background low-volume probes seem to have gone away. Every so often
some attacker will fire up a serious bulk probe, making (for example)
400 attempts over a half an hour (often for a random assortment of
nonexistent logins); rarely there will be a burst where a dozen IPs
will each make an attempt or two and then stop (there's some signs
that a lot of the IPs are Tor exit nodes). But for a lot of the
time, there's nothing. We can go an hour or three with absolutely
no probes at all, which never used to happen; previously a typical
baseline rate of probes was around a hundred an hour.</p>
<p>Since the higher-rate SSH probes get through fine, this doesn't
seem to be anything in our firewalls or local configurations (I
initially wondered about things like a change in logging that came
in with an Ubuntu package update). Instead it seems to be a change
in attacker behavior, and since it took about two hours to take
full effect on Tuesday morning, I wonder if it was something getting
progressively shut down or reoriented.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SSHBruteForceAttacksAbruptlyDown?showcomments#comments">3 comments</a>.) </div>A recent abrupt change in Internet SSH brute force attacks against us2024-02-26T21:43:53Z2024-02-23T04:00:51Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSGlobalZILInformationcks<div class="wikitext"><p>The <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSTXGsAndZILs">ZFS Intent Log (ZIL)</a> is effectively
ZFS's version of a filesystem journal, writing out hopefully brief
records of filesystem activity to make them durable on disk before
their full version is committed to the ZFS pool. What the ZIL is
doing and how it's performing can be important for the latency (and
thus responsiveness) of various operations on a ZFS filesystem,
since operations like <code>fsync()</code> on an important file must wait for
the ZIL to write out (<em>commit</em>) their information before they can
return from the kernel. On Linux, <a href="https://openzfs.org/">OpenZFS</a>
exposes global information about the ZIL in <code>/proc/spl/kstat/zfs/zil</code>,
but this information can be hard to interpret without some knowledge
of ZIL internals.</p>
<p>(In OpenZFS 2.2 and later, each dataset also has per-dataset ZIL
information in its kstat file, /proc/spl/kstat/zfs/<pool>/objset-0xXXX,
for some hexadecimal '0xXXX'. There's no overall per-pool ZIL information
the way there is a global one, but for most purposes you can sum up the
ZIL information from all of the pool's datasets.)</p>
<p>The basic background here is <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSZILActivityFlow">the flow of activity in the ZIL</a> and also the comments in <a href="https://github.com/openzfs/zfs/blob/master/include/sys/zil.h">zil.h</a> about
the members of the <code>zil_stats</code> struct.</p>
<p>The (ZIL) data you can find in the "<code>zil</code>" file (and the per-dataset
kstats in OpenZFS 2.2 and later) is as follows:</p>
<ul><li><code>zil_commit_count</code> counts how many times a ZIL commit has been
requested through things like <code>fsync()</code>.</li>
<li><code>zil_commit_writer_count</code> counts how many times the ZIL has actually
committed. More than one commit request can be merged into the same ZIL
commit, if two people <code>fsync()</code> more or less at the same time.<p>
</li>
<li><code>zil_itx_count</code> counts how many <em>intent transactions</em> (itxs) have
been written as part of ZIL commits. Each separate operation (such
as a <code>write()</code> or a file rename) gets its own separate transaction;
these are aggregated together into <em>log write blocks</em> (lwbs) when
a ZIL commit happens.</li>
</ul>
<p>When ZFS needs to record file data into the ZIL, it has three options,
which it calls '<code>indirect</code>', '<code>copied</code>', and '<code>needcopy</code>' in ZIL
metrics. Large enough amounts of file data are handled with an
<em>indirect</em> write, <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSWritesAndZIL">which writes the data to its final location in the
regular pool</a>; the ZIL transaction only
records its location, hence 'indirect'. In a <em>copied</em> write, the data
is directly and immediately put in the ZIL transaction (itx), even
before it's part of a ZIL commit; this is done if ZFS knows that the
data is being written synchronously and it's not large enough to trigger
an indirect write. In a <em>needcopy</em> write, the data just hangs around in
RAM as part of ZFS's regular dirty data, and if a ZIL commit happens
that needs that data, the process of adding its itx to the log write
block will fetch the data from RAM and add it to the itx (or at least
the lwb).</p>
<p>There are ZIL metrics about this:</p>
<ul><li><code>zil_itx_indirect_count</code> and <code>zil_itx_indirect_bytes</code>
count how many indirect writes have been part of ZIL commits, and the
total size of the indirect writes of file data (not of the 'itx' records
themselves, per the comments in <a href="https://github.com/openzfs/zfs/blob/master/include/sys/zil.h">zil.h</a>).<p>
Since these are indirect writes, the data written is not part of
the ZIL (it's regular data blocks), although it is put on disk
as part of a ZIL commit. However, unlike other ZIL data, the data
written here would have been written even without a ZIL commit,
as part of ZFS's regular transaction group commit process. A ZIL
commit merely writes it out earlier than it otherwise would have
been.<p>
</li>
<li><code>zil_itx_copied_count</code> and <code>zil_itx_copied_bytes</code> count how
many 'copied' writes have been part of ZIL commits and the total size
of the file data written (and thus committed) this way.<p>
</li>
<li><code>zil_itx_needcopy_count</code> and <code>zil_itx_needcopy_bytes</code> count
how many 'needcopy' writes have been part of ZIL commits and the total
size of the file data written (and thus committed) this way.</li>
</ul>
<p>A regular system using ZFS may have little or no 'copied' activity.
Our NFS servers all have significant amounts of it, presumably
because some NFS data writes are done synchronously and so this
trickles through to the ZFS stats.</p>
<p>In a given pool, the ZIL can potentially be written to either the
main pool's disks or to a separate log device (a <em>slog</em>, which can
also be mirrored). The ZIL metrics have a collection of
<code>zil_itx_metaslab_*</code> metrics about data actually written to the
ZIL in either the main pool ('normal' metrics) or to a slog (the
'slog' metrics).</p>
<ul><li><code>zil_itx_metaslab_normal_count</code> counts how many ZIL <em>log
write blocks</em> (not ZIL records, itxs) have been committed to the
ZIL in the main pool. There's a corresponding 'slog' version of
this and all further zil_itx_metaslab metrics, with the same
meaning.<p>
</li>
<li><code>zil_itx_metaslab_normal_bytes</code> counts how many bytes have
been 'used' in ZIL log write blocks (for ZIL commits in the main
pool). This is a rough representation of how much space the ZIL
log actually needed, but it doesn't necessarily represent either
the actual IO performed or the space allocated for ZIL commits.<p>
As I understand things, this size includes the size of the intent
transaction records themselves and also the size of the associated
data for 'copied' and 'needcopy' data writes (because these are
written into the ZIL as part of ZIL commits, and so use space in log
write blocks). It doesn't include the data written directly to the
pool as 'indirect' data writes.</li>
</ul>
<p>If you don't use a slog in any of your pools, the 'slog' versions of
these metrics will all be zero. I think that if you have only slogs, the
'normal' versions of these metrics will all be zero.</p>
<p>In ZFS 2.2 and later, there are two additional statistics for
both normal and slog ZIL commits:</p>
<ul><li><code>zil_itx_metaslab_normal_write</code> counts how many bytes have
actually been written in ZIL log write blocks. My understanding
is that this includes padding and unused space at the end of a
log write block that can't fit another record.<p>
</li>
<li><code>zil_itx_metaslab_normal_alloc</code> counts how many bytes of space have
been 'allocated' for ZIL log write blocks, including any rounding up
to block sizes, alignments, and so on. I think this may also be the
logical size before any compression done as part of IO, although I'm
not sure if ZIL log write blocks are compressed.</li>
</ul>
<p>You can see some additional commentary on these new stats (and the
code) in <a href="https://github.com/openzfs/zfs/pull/14863">the pull request</a>
and <a href="https://github.com/openzfs/zfs/commit/b6fbe61fa6a75747d9b65082ad4dbec05305d496">the commit itself</a>.</p>
<p>PS: OpenZFS 2.2 and later has a currently undocumented '<code>zilstat</code>'
command, and its 'zilstat -v' output may provide some guidance on
what ratios of these metrics the ZFS developers consider interesting.
In its current state it will only work on 2.2 and later because it
requires the two new stats listed above.</p>
<h3>Sidebar: Some typical numbers</h3>
<p>Here is the "zil" file from <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">my office desktop</a>,
which has been up for long enough to make it interesting:</p>
<blockquote><pre style="white-space: pre-wrap;">
zil_commit_count 4 13840
zil_commit_writer_count 4 13836
zil_itx_count 4 252953
zil_itx_indirect_count 4 27663
zil_itx_indirect_bytes 4 2788726148
zil_itx_copied_count 4 0
zil_itx_copied_bytes 4 0
zil_itx_needcopy_count 4 174881
zil_itx_needcopy_bytes 4 471605248
zil_itx_metaslab_normal_count 4 15247
zil_itx_metaslab_normal_bytes 4 517022712
zil_itx_metaslab_normal_write 4 555958272
zil_itx_metaslab_normal_alloc 4 798543872
</pre>
</blockquote>
<p>With these numbers we can see interesting things, such as that the
average number of ZIL transactions per commit is about 18 and
that my machine has never done any synchronous data writes.</p>
<p>Here's an excerpt from one of <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our Ubuntu 22.04 ZFS fileservers</a>:</p>
<blockquote><pre style="white-space: pre-wrap;">
zil_commit_count 4 155712298
zil_commit_writer_count 4 155500611
zil_itx_count 4 200060221
zil_itx_indirect_count 4 60935526
zil_itx_indirect_bytes 4 7715170189188
zil_itx_copied_count 4 29870506
zil_itx_copied_bytes 4 74586588451
zil_itx_needcopy_count 4 1046737
zil_itx_needcopy_bytes 4 9042272696
zil_itx_metaslab_normal_count 4 126916250
zil_itx_metaslab_normal_bytes 4 136540509568
</pre>
</blockquote>
<p>Here we can see the drastic impact of NFS synchronous writes (the
significant 'copied' numbers), and also of large NFS writes in
general (the high 'indirect' numbers). This machine has written
many times more data in ZIL commits as 'indirect' writes as it
has written to the actual ZIL.</p>
</div>
What ZIL metrics are exposed by (Open)ZFS on Linux2024-02-26T21:43:53Z2024-02-22T04:44:14Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NetworkManagerDoesNotSharecks<div class="wikitext"><p>Today I upgraded my home desktop to Fedora 39. It didn't entirely
go well; specifically, <a href="https://mastodon.social/@cks/111965809776629255">my DSL connection broke because Fedora
stopped packaging some scripts with rp-pppoe</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NetworkScriptsAndPPPoE">Fedora's
old <code>ifup</code>, which is used by my very old-fashioned setup</a> still requires those scripts. After I got
back on the Internet, I decided to try an idea I'd toyed with,
namely <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NetworkManagerWhyConsidering">using NetworkManager to handle (only) my DSL link</a>. Unfortunately this did not go well:</p>
<blockquote><p>audit: op="connection-activate" uuid="[...]" name="[...]" pid=458524
uid=0 result="fail" reason="Connection '[...]' is not available on
device em0 because device is strictly unmanaged"</p>
</blockquote>
<p>The reason that em0 is 'unmanaged' by NetworkManager is that it's
managed by systemd-networkd, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdNetworkdWhy">which I like much better</a>. Well, also I specifically told NetworkManager
not to touch it by setting it as 'unmanaged' instead of 'managed'.</p>
<p>Although I haven't tested, I suspect that NetworkManager applies
this restriction to all VPNs and other layered forms of networking,
such that you can only run a NetworkManager managed VPN over a
network interface that NetworkManager is controlling. I find this
quite unfortunate. There is nothing that NetworkManager needs to
change on the underlying Ethernet link to run PPPoE or a VPN over
it; the network is a transport (a low level transport in the case
of <a href="https://en.wikipedia.org/wiki/Point-to-Point_Protocol_over_Ethernet">PPPoE</a>).</p>
<p>I don't know if it's theoretically possible to configure NetworkManager
so that an interface is 'managed' but NetworkManager doesn't touch
it at all, so that systemd-networkd and other things could continue
to use em0 while NetworkManager was willing to run PPPoE on top of
it. Even if it's possible in theory, I don't have much confidence
that it will be problem free in practice, either now or in the
future, because fundamentally I'd be lying to NetworkManager and
networkd. If NetworkManager really had a 'I will use this interface
but not change its configuration' category, it would have a third
option besides 'managed or '(strictly) unmanaged'.</p>
<p>(My current solution is a hacked together script to start pppd and
pppoe with magic options researched through <a href="https://github.com/leahneukirchen/extrace">extrace</a> and a systemd service
that runs that script. I have assorted questions about how this is
going to interactive with <a href="https://mastodon.social/@cks/111966685915895435">various things</a>, but someday I
will get answers, or perhaps unpleasant surprises.)</p>
<p>PS: Where this may be a special problem someday is if I want to run
a VPN over my DSL link. I can more or less handle running PPPoE by
hand, but the last time I looked at a by hand OpenVPN setup I rapidly
dropped the idea. NetworkManager is or would be quite handy for this
sort of 'not always there and complex' networking, but it apparently
needs to own the entire stack down to Ethernet.</p>
<p>(To run a NetworkManager VPN over 'ppp0', I would have to have
NetworkManager manage it, which would presumably require I have
NetworkManager handle the PPPoE DSL, which requires NetworkManager
not considering em0 to be unmanaged. It's NetworkManager all the
way down.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NetworkManagerDoesNotShare?showcomments#comments">2 comments</a>.) </div>NetworkManager won't share network interfaces, which is a problem2024-02-26T21:43:53Z2024-02-21T03:55:01Ztag:cspace@cks.mef.org,2009-03-24:/blog/solaris/ZFSZILActivityFlowcks<div class="wikitext"><p>The <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSTXGsAndZILs">ZFS Intent Log (ZIL)</a> is a confusing thing
once you get into the details, and for reasons beyond the scope of
this entry I recently needed to sort out the details of some aspects
of how it works. So here is what I know about how things flow into the
ZIL, both in memory and then on to disk.</p>
<p>(As always, there is no single 'ZFS Intent Log' in a ZFS pool. Each
dataset (a filesystem or a zvol) has its own logically separate
ZIL. We talk about 'the ZIL' as a convenience.)</p>
<p>When you perform activities that modify a ZFS dataset, each activity
creates its own ZIL log record (a <em>transaction</em> in ZIL jargon,
sometimes called an 'itx', probably short for 'intent transaction')
that is put into that dataset's in-memory ZIL log. This includes
both straightforward data writes and metadata activity like creating
or renaming files. You can see a big list of all of the possible
transaction types in <a href="https://github.com/openzfs/zfs/blob/master/include/sys/zil.h">zil.h</a> as
all of the <code>TX_*</code> definitions (which have brief useful comments).
In-memory ZIL transactions aren't necessarily immediately flushed
to disk, especially for things like simply doing a <code>write()</code> to
a file. The reason that plain <code>write()</code>s to a file are (still) given
ZIL transactions is that you may call <code>fsync()</code> on the file later.
If you don't call <code>fsync()</code> and the regular ZFS transaction group
commits with your <code>write()</code>s, those ZIL transactions will be quietly
cleaned out of the in-memory ZIL log (along with all of the other now
unneeded ZIL transactions).</p>
<p>(All of this assumes that your dataset doesn't have '<code>sync=disabled</code>'
set, which turns off the in-memory ZIL as one of its effects.)</p>
<p>When you perform an action such as <code>fsync()</code> or <code>sync()</code> that
requests that in-memory ZFS state be made durable on disk, ZFS
gathers up some or all of those in-memory ZIL transactions and
writes them to disk in one go, as a sequence of <em>log (write) blocks</em>
('lwb' or 'lwbs' in ZFS source code), which pack together those ZIL
transaction records. This is called a <em>ZIL commit</em>. Depending on
<a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSWritesAndZIL">various</a> <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSWritesAndZILII">factors</a>, the
flushed out data you <code>write()</code> may or may not be included in the
log (write) blocks committed to the (dataset's) ZIL. Sometimes your
file data will be written directly into its future permanent location
in the pool's free space (<a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSZILSafeDirectWrites">which is safe</a>)
and the ZIL commit will have only a pointer to this location (<a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSBlockPointers">its
DVA</a>).</p>
<p>(For a discussion of this, see the comments about the <code>WR_*</code>
constants in <a href="https://github.com/openzfs/zfs/blob/master/include/sys/zil.h">zil.h</a>. Also, <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSTXGsAndZILs">while in memory, ZFS transactions
are classified as either 'synchronous' or 'asynchronous'</a>.
Sync transactions are always part of a ZIL commit, but async
transactions are only included as necessary. See <a href="https://github.com/openzfs/zfs/blob/master/include/sys/zil_impl.h">zil_impl.h</a>
and also <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSTXGsAndZILs">my entry discussing this</a>.)</p>
<p>It's possible for several processes (or threads) to all call <code>sync()</code>
or <code>fsync()</code> at once (well, before the first one finishes committing
the ZIL). In this case, their requests can all be merged together
into one ZIL commit that covers all of them. This means that <code>fsync()</code>
and <code>sync()</code> calls don't necessarily match up one to one with ZIL
commits. I believe it's also possible for a <code>fsync()</code> or <code>sync()</code>
to not result in a ZIL commit if all of the relevant data has already
been written out as part of a regular ZFS transaction group (or a
previous request).</p>
<p>Because of all of this, there are various different ZIL related
metrics that you may be interested in, sometimes with picky but
important differences between them. For example, there is a difference
between 'the number of bytes written to the ZIL' and 'the number
of bytes written as part of ZIL commits', since the latter would
include data written directly to its final space in the main pool.
You might care about the latter when you're investigating the overall
IO impact of ZIL commits but the former if you're looking at sizing
a separate log device (a 'slog' in ZFS terminology).</p>
</div>
The flow of activity in the ZFS Intent Log (as I understand it)2024-02-26T21:43:52Z2024-02-20T02:58:13Ztag:cspace@cks.mef.org,2009-03-24:/blog/web/TLSCertsSomeStillManualcks<div class="wikitext"><p>I've written before about <a href="https://utcc.utoronto.ca/~cks/space/blog/web/TLSCertRenewalTiming">how people's soon to expire TLS
certificates aren't necessarily a problem</a>,
because not everyone manages their TLS certificates through <a href="https://letsencrypt.org/">Let's
Encrypt</a> like '30 day in advance automated
renewal' and perhaps short-lived TLS certificates. For example,
some places (like Facebook) have automation but seem to only deploy
TLS certificates that are quite close to expiry. Other places at
least look as if they're still doing things by hand, and recently
I got to watch an example of that.</p>
<p>As I mentioned <a href="https://utcc.utoronto.ca/~cks/space/blog/web/OutsourcedWebCMSSensible">yesterday</a>, <a href="https://utcc.utoronto.ca/~cks/space/blog/web/OutsourcedWebCMSSensible">the
department outsources its public website to a SaaS CMS provider</a>. While the website has <a href="https://web.cs.toronto.edu/">a name here</a> for obvious reasons, it uses various
assets that are hosted on sites under the SaaS provider's domain
names (both assets that are probably general and assets, like images,
that are definitely specific to us). For reasons beyond the scope
of this entry, <a href="https://support.cs.toronto.edu/">we</a> monitor the
reachability of these additional domain names with <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">our metrics
system</a>. This only checks
on-campus reachability, of course, but that's still important even
if most visitors to the site are probably from outside the university.</p>
<p>As a side effect of this reachability monitoring, we harvest the
TLS certificate expiry times of these domains, and because we haven't
done anything special about it, they get show on our core status
dashboard along side the expiry times of TLS certificates that we're
actually responsible for. The result of this was that recently I
got to watch their TLS expiry times count down to only two weeks
away, which is lots of time from one view while also alarmingly
little if you're used to renewals 30 days in advance. Then they
flipped over to new a new year-long TLS certificate and our dashboard
was quiet again (except for the next such external site that has
dropped under 30 days).</p>
<p>Interestingly, the current TLS certificate was issued about a week
before it was deployed, or at least its Not-Before date is February
9th at 00:00 UTC and it seems to have been put into use this past
Friday, the 16th. One reason for this delay in deployment is suggested
by our monitoring, which seems to have detected traces of a third
certificate sometimes being visible, this one expiring June 23rd,
2024. Perhaps there were some deployment challenges across the SaaS
provider's fleet of web servers.</p>
<p>(Their current TLS certificate is actually good for just a bit over
a year, with a Not-Before of 2024-02-09 and a Not-After of 2025-02-28.
This is presumably accepted by browsers, even though it's bit over
365 days; I haven't paid attention to the latest restrictions from
places like Apple.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/web/TLSCertsSomeStillManual?showcomments#comments">One comment</a>.) </div>Even big websites may still be manually managing TLS certificates (or close)2024-02-26T21:43:53Z2024-02-19T03:06:08Ztag:cspace@cks.mef.org,2009-03-24:/blog/web/OutsourcedWebCMSSensiblecks<div class="wikitext"><p>I work for <a href="https://www.cs.toronto.edu/">a pretty large Computer Science department</a>, one where we have the expertise and
need to do a bunch of internal development and in general we maintain
plenty of things, including websites. Thus, it may surprise some
people to learn that <a href="https://web.cs.toronto.edu/">the department's public-focused web site</a> is currently hosted externally on a
SaaS provider. Even the previous generation of our <a href="https://utcc.utoronto.ca/~cks/space/blog/web/FacingDilemma">outside-facing</a> web presence was hosted and managed outside of the
department. To some, this might seem like the wrong decision for a
department of Computer Science (of all people) to make; surely we're
capable of operating our own web presence and thus should as a
matter of principle (and independence).</p>
<p>Well, yes and no. There are two realities. The first is that a
modern content management system is both a complex thing (to develop
and to generally to operate and maintain securely) and a commodity,
with many organizations able to provide good ones at competitive
prices. The second is that both the system administration and the
publicity side of the department only have so many people and so
much time. Or, to put it another way, all of us have work to get
done.</p>
<p>The department has no particular 'competitive advantage' in running
a CMS website; in fact, we're almost certain to be worse at it than
someone doing it at scale commercially, <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/UniversityEmailEnd">much like what happened
with webmail</a>. If the department decided
to operate its own CMS anyway, it would be as a matter of principle
(which principles would depend on whether the CMS was free or paid
for). So far, the department has not decided that this particular
principle is worth paying for, both in direct costs and in the
opportunity costs of what that money and staff time could otherwise
be used for.</p>
<p>Personally I agree with that decision. As mentioned, CMSes are a
widely available (but specialized) commodity. Were we to do it
ourselves, we wouldn't be, say, making a gesture of principle against
the centralization of CMSes. We would merely be another CMS operator
in an already crowded pond that has many options.</p>
<p>(And people here do operate plenty of websites and web content on
our own resources. It's just that the group here responsible for
<a href="https://web.cs.toronto.edu/">our public web presence</a> found it
most effective and efficient to use a SaaS provider for this
particular job.)</p>
</div>
We outsource our public web presence and that's fine2024-02-26T21:43:53Z2024-02-18T02:39:20Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/DesktopECCOptions2024cks<div class="wikitext"><p>A traditional irritation with building (or specifying) desktop
computers is <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/UseECCIrritation">the issue of ECC RAM</a>, which for
a long time was either not supported at all or <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/IntelCPUSegmentationIrritation">was being used by
Intel for market segmentation</a>.
First generation AMD Ryzens sort of supported ECC RAM with the right
motherboard, but <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/ECCRAMSupportLevels">there are many meanings of 'supporting' ECC RAM</a> and questions lingered about how meaningful
the support was (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/AMDWithECCKernelMessages">recent information suggests the support was real</a>). Here in early 2024 the situation
is somewhat better and I'm going to summarize what I know so far.</p>
<p>The traditional option to getting ECC RAM support (along with a
bunch of other things) was to buy a 'workstation' motherboard that
was built to support Intel Xeon processors. These were available
from a modest number of vendors, such as SuperMicro, and were
generally not inexpensive (and then you had to buy the Xeon). If
you wanted a pre-built solution, vendors like Dell would sell you
desktop Xeon-based workstation systems with ECC RAM. You can still
do this today.</p>
<p>Update: I forgot AMD Threadripper and Epyc based systems, which you
can get motherboards for and build desktop systems around. I think
these are generally fairly expensive motherboards, though.</p>
<p>Back in 2022, Intel introduced their <a href="https://en.wikipedia.org/wiki/LGA_1700#Alder_Lake_chipsets_(600_series)">W680 desktop chipset</a>.
One of the features of this chipset is that it officially supported
ECC RAM with 12th generation and later (so far) Intel CPUs (or at
least apparently the non-F versions), along with official support
for memory overclocking (and CPU overclocking), which enables faster
'XMP' memory profiles than the stock ones (should your ECC RAM
actually support this). There are a modest number of W680 based
motherboards available from (some of) the usual x86 PC desktop
motherboard makers (and SuperMicro), but they are definitely priced
at the high end of things. Intel has not yet announced <a href="https://en.wikipedia.org/wiki/LGA_1700#Raptor_Lake_chipsets_(700_series)">a 'Raptor
Lake' chipset version of this</a>,
which would presumably be called the 'W780'. At this date I suspect
there will be no such chipset.</p>
<p>(The Intel W680 chipset was brought to my attention <a href="https://mastodon.social/@bshanks/111897549472732911">by Brendan
Shanks on the Fediverse</a>.)</p>
<p>As mentioned, AMD support for ECC on early generation Ryzens was a
bit lackluster, although it was sort of there. With the current
<a href="https://en.wikipedia.org/wiki/Socket_AM5">Socket AM5</a> and <a href="https://en.wikipedia.org/wiki/Zen_4">Zen
4</a>, a lot of mentions of ECC
seem to have (initially) been omitted from documentation, as discussed
in Rain's <a href="https://sunshowers.io/posts/am5-ryzen-7000-ecc-ram/">ECC RAM on AMD Ryzen 7000 desktop CPUs</a>, and <a href="https://www.tomshardware.com/pc-components/cpus/amd-confirms-ryzen-8000g-apus-dont-support-ecc-ram-despite-initial-claims">Ryzen
8000G series APUs don't support ECC at all</a>.
However, at least some AM5 motherboards do support ECC with recent
enough firmware (provided that you have recent BIOS updates and
enable ECC support in the BIOS, per Rain). These days, it appears
that a number of current AM5 motherboards list ECC memory as supported
(although <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/ECCRAMSupportLevels">what supported means is a question</a>)
and it will probably work, especially if you find people who already
have reported success. It seems that even some relatively inexpensive
AM5 motherboards may support ECC.</p>
<p>(Some un-vetted resources are <a href="https://old.reddit.com/r/truenas/comments/10lqofy/ecc_support_for_am5_motherboards/">here</a>
and <a href="https://forum.level1techs.com/t/am5-consumer-motherboards-with-full-reporting-and-correcting-ecc/200543">here</a>.)</p>
<p>If you can navigate the challenges of finding a good motherboard,
it looks like an AM5, Ryzen 7000 system will support ECC at a lower
cost than an Intel W680 based system (or an Intel Xeon one). If you
don't want to try to thread those rapids and can stand Intel CPUs,
a W680 based system will presumably work, and a Xeon based system
would be even easier to purchase as a fully built desktop with ECC.</p>
<p>(Whether ECC makes a meaningful difference that's worth paying for
is a bit of an open question.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/DesktopECCOptions2024?showcomments#comments">6 comments</a>.) </div>Options for genuine ECC RAM on the desktop in (early) 20242024-02-26T21:43:52Z2024-02-17T04:52:09Ztag:cspace@cks.mef.org,2009-03-24:/blog/unix/XOffscreenWindowsUsecks<div class="wikitext"><p>I mentioned recently that the <a href="https://en.wikipedia.org/wiki/X_Window_System">X Window System</a> allows you to
position (X) windows so that they're partially or completely off
the screen (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/XOffscreenIconMistake">when I wrote about how I accidentally put some icons
off screen</a>). Some window managers, such as
<a href="https://fvwm.org/">fvwm</a>, actually make significant use of this
X capability.</p>
<p>To start with, windows can be off screen in any direction, because X
permits negative coordinates for window locations (both horizontally and
vertically). Since the top left of the screen is 0, 0 in the coordinate
system, windows with a negative X are often said to be off screen to
the left, and ones with a negative Y are off screen 'above', to go with
a large enough positive X being 'to the right' and a positive Y being
'below'. If a window is completely off the screen, its relative location
is in some sense immaterial, but this makes it easier to talk about some
other things.</p>
<p>(Windows can also be partially off screen, in which case it does
matter that negative Y is 'above' and negative X is 'left', because
the bottom or the right part of such a window is what will be visible
on screen.)</p>
<p>Fvwm has a concept of a 'virtual desktop' that can be larger than
your physical display (or displays added together), normally expressed
in units of your normal monitor configuration; for example, my
virtual desktop is three wide by two high, creating six of what
Fvwm calls <em>pages</em>. Fvwm calls the portion of the virtual desktop
that you can see the <em>viewport</em>, and many people (me included) keep
the viewport aligned with pages. You can then talk about things
like <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/MyFvwmButtonBindings">flipping between pages</a>,
which is technically moving the viewport to or between pages.</p>
<p>When you change pages or in general move the viewport, Fvwm changes
the X position of windows so that they are in the right (pixel)
spot relative to the new page. For instance, if you have a 1280
pixel wide display and a window positioned with its left edge at
0, then you move one Fvwm page to your right, Fvwm changes the
window's X coordinate to be -1280. If you want, you can then use X
tools or other means to move the window around on its old page, and
when you flip back to the page Fvwm will respect that new location.
If you move the window to be 200 pixels away from the left edge,
making it's X position -1080, when you change back to that page
Fvwm will put the window's left edge at an X position of 200 pixels.</p>
<p>This is an elegant way to avoid having to keep track of the nominal
position of off-screen windows; you just have X do it for you. If
you have a 1280 x 1024 display and you move one page to the left,
you merely add 1280 pixels to the X position of the (X) windows
being displayed. Windows on the old page will now be off screen,
while windows on the new page will come back on screen.</p>
<p>I think most X desktop environments and window managers have moved
away from this simple and brute force approach to handle windows
that are off screen because you've moved your virtual screen or
workspace or whatever the environment's term is. I did a quick test
in Cinnamon, and it didn't seem to change window positions this
way.</p>
<p>(There are other ways in X to make windows disappear and reappear,
so Cinnamon is using one of them.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/XOffscreenWindowsUse?showcomments#comments">One comment</a>.) </div>(Some) X window managers deliberately use off-screen windows2024-02-26T21:43:52Z2024-02-16T03:51:41Ztag:cspace@cks.mef.org,2009-03-24:/blog/programming/GoReflectTypeForOptimizationcks<div class="wikitext"><p>Go's <a href="https://pkg.go.dev/reflect#TypeFor"><code>reflect.TypeFor()</code></a> is
a generic function that returns the <a href="https://pkg.go.dev/reflect#Type"><code>reflect.Type</code></a> for its type argument. It was
added in Go 1.22, and <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/Go122ReflectTypeFor">its initial implementation was quite simple
but still valuable</a>, because it encapsulated
a complicated bit of <a href="https://pkg.go.dev/reflect"><code>reflect</code></a> usage.
Here is that implementation:</p>
<blockquote><pre style="white-space: pre-wrap;">
func TypeFor[T any]() Type {
return TypeOf((*T)(nil)).Elem()
}
</pre>
</blockquote>
<p>How this works is that it constructs a nil pointer value of the
type 'pointer to T', gets the <a href="https://pkg.go.dev/reflect#Type"><code>reflect.Type</code></a> of that pointer,
and then uses Type.Elem() to go from the pointer's Type to the Type
for T itself. This requires constructing and using this 'pointer
to T' type (and its <a href="https://pkg.go.dev/reflect#Type"><code>reflect.Type</code></a>) even though we only what
the <a href="https://pkg.go.dev/reflect#Type"><code>reflect.Type</code></a> of T itself. All of this is necessary for
<a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GoNilIsTypedSortOf">reasons to do with interface types</a>.</p>
<p>Recently, <code>reflect.TypeFor()</code> was optimized a bit, in <a href="https://go-review.googlesource.com/c/go/+/555597">CL 555597,
"optimize TypeFor for non-interface types"</a>. The code for
this optimization is a bit tricky and I had to stare at it for a
while to understand what it was doing and how it worked. Here is
the new version, which starts with the new optimization and ends
with the old code:</p>
<blockquote><pre style="white-space: pre-wrap;">
func TypeFor[T any]() Type {
var v T
if t := TypeOf(v); t != nil {
return t
}
return TypeOf((*T)(nil)).Elem()
}
</pre>
</blockquote>
<p>What this does is optimize for the case where you're using TypeFor()
on a non-interface type, for example '<code>reflect.TypeFor[int64]()</code>'
(although you're more likely to use this with more complex things
like struct types). When T is a non-interface type, we don't need
to construct a pointer to a value of the type; we can directly
obtain the Type from <a href="https://pkg.go.dev/reflect#TypeOf"><code>reflect.TypeOf</code></a>. But how do we tell whether or
not T is an interface type? The answer turns out to be right there
in the documentation for <a href="https://pkg.go.dev/reflect#TypeOf"><code>reflect.TypeOf</code></a>:</p>
<blockquote><p>[...] If [TypeOf's argument] is a nil interface value, TypeOf
returns nil.</p>
</blockquote>
<p>So what the new code does is construct a zero value of type T, pass
it to TypeOf(), and check what it gets back. If type T is an interface
type, its zero value is a nil interface and TypeOf() will return
nil; otherwise, the return value is the <a href="https://pkg.go.dev/reflect#Type"><code>reflect.Type</code></a> of the
non-interface type T.</p>
<p>The reason that <a href="https://pkg.go.dev/reflect#TypeOf"><code>reflect.TypeOf</code></a> returns nil for a nil interface
value is because it has to. <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GoNilIsTypedSortOf">In Go, <code>nil</code> is only sort of typed</a>, so if a nil interface value is passed to
TypeOf(), there is effectively no type information available for
it; its old interface type is lost when it was converted to <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GoAvoidingAnyAsAType">'<code>any</code>',
also known as the empty interface</a>. So all
TypeOf() can return for such a value is the nil result of 'this
effectively has no useful type information'.</p>
<p>Incidentally, the TypeFor() code is also another illustration of
how in Go, <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GoNilNotNil">interfaces create a difference between two sorts of
nils</a>. Consider calling '<code>reflect.TypeFor[*os.File]()</code>'.
Since this is a pointer type, the zero value '<code>v</code>' in TypeFor() is
a nil pointer. But <a href="https://pkg.go.dev/os#File"><code>os.File</code></a> isn't
an interface type, so TypeOf() won't be passed a nil interface and
can return a Type, even though the underlying value in the interface
that TypeOf() receives is a nil pointer.</p>
</div>
Understanding a recent optimization to Go's <code>reflect.TypeFor</code>2024-02-26T21:43:52Z2024-02-15T04:12:03Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSPoolTXGsInformationcks<div class="wikitext"><p>As part of (Open)ZFS's general 'kstats' system for reporting
information about ZFS overall and your individual pools and datasets,
there is a per-pool /proc file that reports information about the
most recent N <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSTXGsAndZILs">transaction groups ('txgs')</a>, /proc/spl/kstat/zfs/<pool>/txgs.
How many N is depends on the <a href="https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-txg-history">zfs_txg_history</a>
parameter, and defaults to 100. The information in here may be quite
important for diagnosing certain sorts of performance problems but
I haven't found much documentation on what's in it. Well, let's try
to fix that.</p>
<p>The overall format of this file is:</p>
<blockquote><pre style="white-space: pre-wrap;">
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
5846176 7976255438836187 C 1736704 0 5799936 0 299 5119983470 2707 49115 27910766
[...]
5846274 7976757197601868 C 1064960 0 4702208 0 236 5119973466 2405 48349 134845007
5846275 7976762317575334 O 0 0 0 0 0 0 0 0 0
</pre>
</blockquote>
<p>(This example is coming from a system with four-way mirrored vdevs,
which is going to be relevant in a bit.)</p>
<p>So lets take these fields in order:</p>
<ol><li><code>txg</code> is the transaction group number, which is a steadily increasing
number. The file is ordered from the oldest txg to the newest, which
will be the current open transaction group.<p>
(In the example, txg 5846275 is the current open transaction group
and 5846274 is the last one the committed.)<p>
</li>
<li><code>birth</code> is the time when the transaction group (txg) was 'born', in
<em>nanoseconds</em> since the system booted.<p>
</li>
<li><code>state</code> is the current state of the txg; this will most often be either
'C' for committed or 'O' for open. You may also see 'S' for
syncing, 'Q' (being quiesced), and 'W' (waiting for sync). An
open transaction group will most likely have 0s for the rest of
the numbers, and will be the last txg (there's only one open txg
at a time). <strike>Any transaction group except the second last will be
in state 'C', because you can only have one transaction group in
the process of being written out.</strike><p>
Update: per the comment from Arnaud Gomes, you can have multiple
transaction groups at the end that aren't committed. I believe you
can only have one that is syncing ('S'), because that happens in a
single thread for only one txg, but you may have another that is
quiescing or waiting to sync.<p>
A transaction group's progress through its life cycle is open,
quiescing, waiting for sync, syncing, and finally committed. In
the open state, additional transactions (such as writing to files
or renaming them) can be added to the transaction group; once a
transaction group has been quiesced, nothing further will be added
to it.<p>
(See also <a href="https://www.delphix.com/blog/zfs-fundamentals-transaction-groups">ZFS fundamentals: transaction groups</a>,
which discusses how a transaction group can take a while to sync;
the content has also been added as a comment in the source
code in <a href="https://github.com/openzfs/zfs/blob/master/module/zfs/txg.c">txg.c</a>.)<p>
</li>
<li><code>ndirty</code> is how many bytes of directly dirty data had to be written
out as part of this transaction; these bytes come, for example, from
user <code>write()</code> IO.<p>
It's possible to have a transaction group commit with a '0' for
<code>ndirty</code>. I believe that this means no IO happened during the
time the transaction group was open, and it's just being closed
on the timer.<p>
</li>
<li><code>nread</code> is how many bytes of disk reads the pool did between when
syncing of the txg starts and when it finishes ('during txg sync').</li>
<li><code>nwritten</code> is how many bytes of disk writes the pool did during txg sync.</li>
<li><code>reads</code> is the number of disk read IOs the pool did during txg sync.</li>
<li><code>writes</code> is the number of disk write IOs the pool did during txg sync.<p>
I believe these IO numbers include at least any extra IO needed
to read in on-disk data structures to allocate free space and any
additional writes necessary. I also believe that they track actual
bytes written to your disks, so for example with two-way mirrors
they'll always be at least twice as big as the <code>ndirty</code> number
(in my example above, with four way mirrors, their base is four
times <code>ndirty</code>).<p>
As we can see it's not unusual for <code>nread</code> and <code>reads</code> to be zero.
However, I don't believe that the read IO numbers are restricted
to transaction group commit activities; if something is reading
from the pool for other reasons during the transaction group commit,
that will show up in <code>nread</code> and <code>reads</code>. They are thus a measure
of the amount of read IO going during the txg sync process, not
the amount of IO necessary for it.<p>
I don't know if ongoing write IO to the ZFS Intent Log can happen
during a txg sync. If it can, I would expect it to show up in the
<code>nwritten</code> and <code>writes</code> numbers. Unlike read IO, regular write
IO can only happen in the context of a transaction group and so
by definition any regular writes during a txg sync are part of
that txg and show up in <code>ndirty</code>.<p>
</li>
<li><code>otime</code> is how long the txg was open and accepting new write IO, in
nanoseconds. Often this will be around the default <a href="https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-txg-timeout">zfs_txg_timeout</a>
time, which is normally five seconds. However, under (write) IO
pressure this can be shorter or longer (if the current open transaction
group can't be closed because there's already a transaction group in
the process of trying to commit).<p>
</li>
<li><code>qtime</code> is how long the txg took to be quiesced, in nanoseconds; it's
usually small.</li>
<li><code>wtime</code> is how long the txg took to wait to start syncing, in nanoseconds;
it's usually pretty small, since all it involves is that the
separate syncing thread pick up the txg and start syncing it.<p>
</li>
<li><code>stime</code> is how long the txg took to actually sync and commit, again
in nanoseconds. It's often appreciable, since it's where the actual
disk write IO happens.</li>
</ol>
<p>In the example "txgs" I gave, we can see that despite the first
committed txg listed having more dirty data than the last committed
txg, its actual sync time was only about a quarter of the last txg's
sync time. This might cause you to look at underlying IO activity
patterns, latency patterns, and so on.</p>
<p>As far as I know, there's no per-pool source of information about
the current amount of dirty data in the current open transaction
group (although once a txg has quiesced and is syncing, I believe
you do see a useful <code>ndirty</code> for it in the "txgs" file). A system
wide dirty data number can more or less be approximated from <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCMemoryReclaimStats">the
ARC memory reclaim statistics</a> in
the <code>anon_size</code> kstat plus the <code>arc_tempreserve</code> kstat, although
the latter seems to never get very big for us.</p>
<p>A new transaction group normally opens as the current transaction
group begins quiescing. We can verify this in the example output
by adding the birth time and the <code>otime</code> of txg 5846274, which add
up to exactly the birth time of txg 5846275, the current open txg.
If this sounds suspiciously exact down to the nanosecond, that's
because the code involve freezes the current time at one point and
uses it for both the end of the open time of the current open txg
and the birth time of the new txg.</p>
<h3>Sidebar: the progression through transaction group states</h3>
<p>Here is what I can deduce from reading through the OpenZFS kernel
code, and since I had to go through this I'm going to write it down.</p>
<p>First, although there is a txg 'birth' state, 'B' in the 'state'
column, you will never actually see it. Transaction groups are born
'open', per spa_txg_history_add() in <a href="https://github.com/openzfs/zfs/blob/master/module/zfs/spa_stats.c">spa_stats.c</a>.
Transaction groups move from 'O' open to 'Q' quiescing in
txg_quiesce() in <a href="https://github.com/openzfs/zfs/blob/master/module/zfs/txg.c">txg.c</a>, which
'blocks until all transactions in the group are committed' (which
I believe means they are finished fiddling around adding write IO).
This function is also where the txg finishes quiescing and moves
to 'W', waiting for sync. At this point the txg is handed off to
the 'sync thread', txg_sync_thread() (also in <a href="https://github.com/openzfs/zfs/blob/master/module/zfs/txg.c">txg.c</a>). When
the sync thread receives the txg, it will advance the txg to 'S',
syncing, call spa_sync(), and then mark everything as done,
finally moving the transaction group to 'C', committed.</p>
<p>(In the <a href="https://github.com/openzfs/zfs/blob/master/module/zfs/spa_stats.c">spa_stats.c</a> code, the txg state is advanced by a call
to spa_txg_history_set(), which will always be called with the
old state we are finishing. Txgs advance to syncing in
spa_txg_history_init_io(), and finish this state to move to
committed in spa_txg_history_fini_io(). The tracking of read
and write IO during the txg sync is done by saving a copy of
the top level vdev IO stats in spa_txg_history_init_io(),
getting a second copy in spa_txg_history_fini_io(), and then
computing the difference between the two.)</p>
<p>Why it might take some visible time to quiesce a transaction group
is more or less explained in the description of how ZFS's implementations
of virtual filesystem operations work, in the comment at the start
of <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/zfs_vnops_os.c">zfs_vnops_os.c</a>.
Roughly, each operation (such as creating or renaming a file) starts
by obtaining a transaction that will be part of the currently open
txg, then doing its work, and then committing the transaction. If
the transaction group starts quiescing while the operation is doing
its work, the quiescing can't finish until the work does and commits
the transaction for the rename, create, or whatever.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSPoolTXGsInformation?showcomments#comments">2 comments</a>.) </div>What is in (Open)ZFS's per-pool "txgs" /proc file on Linux2024-02-26T21:43:53Z2024-02-14T03:26:14Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/AMDWithECCKernelMessagescks<div class="wikitext"><p>In general, <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/UseECCIrritation">consumer x86 desktops have generally not supported
ECC memory</a>, at least not if you wanted
the 'ECC' bit to actually do anything. <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/IntelCPUSegmentationIrritation">With Intel this seems to
have been an issue of market segmentation</a>, but things with AMD were
more confusing. The <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/RyzenMemorySpeedAndECC">initial AMD Ryzen series seemed to generally
support ECC in the CPU</a>, but the
motherboard support was questionable, and even if your motherboard
accepted ECC DIMMs there was an open question of whether the ECC
was doing anything on any particular motherboard (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/ECCRAMSupportLevels">cf</a>). Later Ryzens have apparently had an
even more confusing ECC support story, but I'm out of touch on that.</p>
<p>When we put together <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">my work desktop</a> we got ECC
DIMMs for it and I thought that theoretically the motherboard
supported ECC, but I've long wondered if it was actually doing
anything. Recently I was looking into this a bit <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/MyMachineDesires2024">for reasons</a> and ran across Rain's <a href="https://sunshowers.io/posts/am5-ryzen-7000-ecc-ram/">ECC RAM on AMD Ryzen
7000 desktop CPUs</a>,
which contained some extremely useful information about how to tell
from your boot messages on AMD systems. I'm going to summarize this
and add some extra information I've dug out of things.</p>
<p>Modern desktop CPUs talk to memory themselves, but not quite directly
from the main CPU; instead, they have a separate on-die memory
controller. On AMD Zen series CPUs, this is the AMD <a href="https://github.com/oxidecomputer/illumos-gate/blob/5f01ecd8941eadb64bc15b1a02c468604c1a503e/usr/src/uts/intel/sys/amdzen/umc.h#L22">Unified Memory
Controller</a>,
and there are special interfaces to talk to it. As I understand
things, ECC is handled (or not) in the UMC, where it receives the
raw bits from your DIMMs (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CheckingRAMDIMMInfo">if your DIMMs are wide enough, which
you may or may not be able to tell</a>). Therefor,
to have ECC support active, you need ECC DIMMs and for ECC to be
enabled in your UMC (which I believe is typically controlled by
the BIOS, assuming the UMC supports ECC, which depends on the CPU).</p>
<p>In Linux, reporting and managing ECC is handled through a general
subsystem called <a href="https://www.kernel.org/doc/html/latest/driver-api/edac.html">EDAC</a>, with
specific hardware drivers. The normal AMD EDAC driver is amd64_edac,
and <a href="https://sunshowers.io/posts/am5-ryzen-7000-ecc-ram/">as covered by Rain</a>, it registers
for memory channels only if the memory channel has ECC on in the
on-die UMC. When this happens, you will see a kernel message to the
effect of:</p>
<blockquote><pre style="white-space: pre-wrap;">
EDAC MC0: Giving out device to module amd64_edac controller F17h: DEV 0000:00:18.3 (INTERRUPT)
</pre>
</blockquote>
<p>It follows that if you do see this kernel message during boot, you
almost certainly have fully supported ECC on your system. It's very
likely that your DIMMs are ECC DIMMs, your motherboard supports ECC
in the hardware and in its BIOS (and has it enabled in the BIOS if
necessary and applicable), and your CPU is willing to do ECC with
all of this. Since the above kernel message comes from <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">my office
desktop</a>, it seems almost certain that it does
indeed fully support ECC, although I don't think I've ever seen
any kernel messages about detecting and correcting ECC issues.</p>
<p>You can see more memory channels in larger systems and they're not
necessarily sequential; one of our large AMD machines has 'MC0' and
'MC2'. You may also see a message about 'EDAC PCI0: Giving out
device to [...]', which is about a different thing.</p>
<p>In the normal Linux kernel way, various EDAC memory controller
information can be found in sysfs under /sys/devices/system/edac/mc
(assuming that you have anything registered, which you may not on
a non-ECC system). This appears to include counts of corrected
errors and uncorrected errors both at the high level of an entire
memory controller and at the level of 'rows', 'ranks', and/or 'dimms'
depending on the system and the kernel version. You can also see
things like the memory EDAC mode, which could be 'SECDED' (what
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">my office desktop</a> reports) or 'S8ECD8ED' (what
a large AMD server reports).</p>
<p>(The 'MC<n>' number reported by the kernel at boot time doesn't
necessarily match the /sys/devices/system/edac/mc<n> number. We
have systems which report 'MC0' and 'MC2' at boot, but have 'mc0'
and 'mc1' in sysfs.)</p>
<p>The <a href="https://github.com/prometheus/node_exporter">Prometheus host agent</a>
exposes this EDAC information as metrics, primarily in
node_edac_correctable_errors_total and
node_edac_uncorrectable_errors_total. We have seen a few corrected
errors over time on one particular system.</p>
<h3>Sidebar: EDAC on Intel hardware</h3>
<p>While there's an Intel memory controller EDAC driver, I don't know
if it can get registered even if you don't have ECC support. If
it is registered with identified memory controllers, and you can
see eg 'SECDED' as the EDAC mode in /sys/devices/system/edac/mc/mcN,
then I think you can be relatively confident that you have ECC
active on that system. On <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2018">my home desktop</a>, which
definitely doesn't support ECC, what I see on boot for EDAC (with
Fedora 38's kernel 6.7.4) is:</p>
<blockquote><pre style="white-space: pre-wrap;">
EDAC MC: Ver: 3.0.0
EDAC ie31200: No ECC support
EDAC ie31200: No ECC support
</pre>
</blockquote>
<p>As expected there are no 'mcN' subdirectories in
/sys/devices/system/edac/mc.</p>
<p>Two Intel servers where I'm pretty certain we have ECC support report,
respectively:</p>
<blockquote><pre style="white-space: pre-wrap;">
EDAC MC0: Giving out device to module skx_edac controller Skylake Socket#0 IMC#0: DEV 0000:64:0a.0 (INTERRUPT)
</pre>
</blockquote>
<p>and</p>
<blockquote><pre style="white-space: pre-wrap;">
EDAC MC0: Giving out device to module ie31200_edac controller IE31200: DEV 0000:00:00.0 (POLLED)
</pre>
</blockquote>
<p>As we can see here, Intel CPUs have more than one EDAC driver, depending
on CPU generation and so on. The first EDAC message comes from a system
with a Xeon Silver 4108, the second from a system with a Xeon E3-1230 v5.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/AMDWithECCKernelMessages?showcomments#comments">One comment</a>.) </div>Linux kernel boot messages and seeing if your AMD system has ECC2024-02-26T21:43:53Z2024-02-13T03:37:18Ztag:cspace@cks.mef.org,2009-03-24:/blog/programming/Go122TypesAliasAndCompatibilitycks<div class="wikitext"><p>Go famously promises <a href="https://go.dev/doc/go1compat">backward compatibility to the first release
of Go</a> and pretty much delivers on
that (although the tools used to build Go programs have changed).
Thus, one may be a bit surprised to read the following about
<a href="https://go.dev/pkg/go/types/">go/types</a> in the <a href="https://go.dev/doc/go1.22">Go 1.22 Release
Notes</a>:</p>
<blockquote><p>The new <a href="https://pkg.go.dev/go/types#Alias"><code>Alias</code></a> type represents
type aliases. Previously, type aliases were not represented
explicitly, so a reference to a type alias was equivalent to spelling
out the aliased type, and the name of the alias was lost. [...]</p>
<p><strong>Because Alias types may break existing type switches that do not
know to check for them</strong>, this functionality is controlled by a
GODEBUG field named <code>gotypesalias</code>. [...] <em>Clients of <a href="https://go.dev/pkg/go/types/">go/types</a>
are urged to adjust their code as soon as possible to work with
<code>gotypesalias=1</code> to eliminate problems early.</em></p>
</blockquote>
<p>(The <strong>bold</strong> emphasis is mine, while the <em>italics</em> are from the release
notes. The current default is <code>gotypesalias=0</code>.)</p>
<p>A variety of things in <a href="https://go.dev/pkg/go/types/">go/types</a> return a <a href="https://pkg.go.dev/go/types#Type"><code>Type</code></a>, which is an interface type that
'represents a type of Go'. Well, more specifically these things
return values of type <code>Type</code>, and these values have various underlying
concrete types. Some code using <a href="https://go.dev/pkg/go/types/">go/types</a> and dealing with <code>Type</code>
values can handle them purely as interfaces, but other code needs
to specifically handle all of the particular types (such as <a href="https://pkg.go.dev/go/types#Func"><code>Array</code></a> and so on). Since <code>Type</code> is an
interface, such code will use a <a href="https://go.dev/ref/spec#Switch_statements">type switch</a> that is supposed to be
exhaustive over all of the concrete types of <code>Type</code> interface values.</p>
<p>Now we can see the problem. When Go introduces a new concrete type
that can be returned as a <code>Type</code> value, those previously exhaustive
type switches stop being exhaustive; there's a new concrete type
that they're not prepared to handle. This could cause various
problems in actual code. And Go has no way of requiring type switches
to be exhaustive, so such code would still build fine but malfunction
at runtime.</p>
<p><a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GoAPIStabilityAndAssumptions">Much like the last time we saw something like this</a>, this change is arguably not an API
break, at least in theory; Go never explicitly promised that there
was a specific and limited list of <a href="https://go.dev/pkg/go/types/">go/types</a> types that implemented
<code>Type</code>, and so in theory Go is free to expand the list. However,
as we can see from the release notes (and the current behavior of
not generating these new <code>Alias</code> types by default), the Go authors
recognize that this is in practice a compatibility break, one that
they're explicitly urging people to be prepared for.</p>
<p>What this shows is that <strong>true long term backward compatibility is very
hard</strong>, and it's especially hard in an area that is inherently evolving,
like exposing information about an evolving language. Getting complete
backward compatibility requires more or less everything about an exposed
API to be frozen, and that generally requires the area to be extremely
well understood (and often pushes towards exposing very minimal APIs,
which has its own problems).</p>
<p>As a side note, I think that Go is handling this change quite well.
They've added the type to <a href="https://go.dev/pkg/go/types/">go/types</a> so that people can add it
to their own code (which will make it require Go 1.22 or later),
and also provided a way that people can test the code (by building
with gotypesalias=1). At the same time no actual '<code>Alias</code>' types
will appear (by default) until some time in the future; I'd guess
no earlier than Go 1.24, a year from now.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/programming/Go122TypesAliasAndCompatibility?showcomments#comments">4 comments</a>.) </div>Go 1.22's go/types Alias type shows the challenge of API compatibility2024-02-26T21:43:52Z2024-02-12T02:31:08Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/HomeBackupPlans2024cks<div class="wikitext"><p>In theory, what I should do to back up <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2018">my home desktop</a>
is fairly straightforward. I should get one or two USB hard drives
of sufficient size, then periodically connect one and do a backup
to it (probably using tar, and potentially not compressing the tar
archives to make them more recoverable in the face of disk errors).
If I'm energetic, I'll have two USB hard drives and periodically
rotate one to the office as an offsite backup. <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/SortingOutModernUSB">Modern USB</a> should be fast enough for this,
and hopefully using (fast) USB drives will no longer <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/USBDrivesKillMyPerformance">kill my
performance the way it used to</a>.
Large HDDs are reasonably affordable, especially if I decide
to live with 5400 RPM ones (which I hope run cooler), so I
could store multiple full system backups on a single HDD.</p>
<p>In practice this is a lot of things to remember to do on a regular
basis, and although I have some of the pieces (and have for years),
those pieces have dust on them from disuse. So this approach isn't
workable as a way to get routine backups; at best I might manage
to do it once every few months. So instead I long ago came up with
a plan that is not so much better as more likely to succeed. The
short version of the plan is that I will make backups to an additional
live HDD in my home desktop.</p>
<p>My home desktop's storage used to be a mirrored pair of SSDs and a
mirrored but mismatched pair of HDDs. Back in early 2023, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidSwitchingDisks">this
became all solid state</a>, with a pair
of NVMe drives and a pair of SSDs (not the same SSDs, the new pair
is much larger). This leaves me with an unused 4 TB HDD, which I
actually (still) have in the case. So I can reuse this 4 TB HDD as
an always-live backup drive, or what is really 'a second copy'
drive. Because the drive will always be there and live, I can
automate copies to it, run them from cron, and more or less forget
about it (once it's working).</p>
<p>The obvious and most readily automated way to make the backups is
to use ZFS snapshots. I'll make a new ZFS pool on the HDD, and then
use snapshots with 'zfs send' and 'zfs receive' to move them from
the solid state storage to the HDD pool. ZFS's read only snapshots
will insure that I can't accidentally damage the backup copies, and
I can scrub the HDD's ZFS pool periodically as insurance against
disk corruption. My total space usage in both my current solid
state ZFS pools is still a bit under 2 TB, so I should have plenty
of space for both on a 4 TB HDD.</p>
<p>This is obviously imperfect, since various sorts of problems could
cost me both the live storage and the HDD, and I could have ZFS
problems too. But it's a lot better than nothing, and <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/PerfectionTrap">sometimes
the perfect is the enemy of the good</a>.</p>
<p>(Having written this, perhaps I will actually implement it. The
current obstacle is that the old HDDs are still running my old LVM
setup, as backup for the ZFS pool I created on the new SSDs and
then theoretically moved all of the LVM's contents to. So I'd have
to hold my breath and tear down those filesystems and the LVM storage
first. Destroying even supposedly completely surplus data makes me
twitch just a bit, and so far it's been easier to do nothing.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeBackupPlans2024?showcomments#comments">5 comments</a>.) </div>My plan for backups of my home machine (as of early 2024)2024-02-26T21:43:53Z2024-02-11T03:00:55Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/CompatibilityLingersUnnoticedcks<div class="wikitext"><p>We have <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OurPasswordPropagation">a system for propagating login and password information
around our fleet</a>. In this system, all
information about user logins flows out from our 'password master'
machine, and each other machine can filter and transform that global
login information as the machine merges it into the local /etc/passwd.
Normal machines use the login information more or less as-is, but
unusual ones can do things like set the shells of all non-staff
accounts to a program that just prints out 'only staff can log in
to this machine' and logs them out. All of this behavior is controlled
by a configuration file that tells the program what to do, by
matching characteristics of logins and then applying transformations
based on what matched. This system has existed for a very long time,
probably since we started significantly using Ubuntu sometime in
late 2006 or 2007.</p>
<p>Because this system is so old, it once existed in a world where we
had a bunch of Solaris servers that users logged in to and the
password master machine itself was a Solaris machine. These Solaris
machines had quite different paths both for some user shells, like
Bash, and 'administrative' shells like the program that told people
this was a staff machine or their account was suspended (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/UnixShellsNoMoreAccessControl">this was
back in the days when you could reasonably use shells for that sort
of thing</a>). When we propagated login
entries from these Solaris machines to our new Ubuntu machines, we
needed to change these Solaris paths to Ubuntu paths, and by 'we'
I mean that our password merging and mangling program did. For
reasons beyond the scope of this entry, these Solaris path rewritings
are specified as transformations in the configuration file, although
in practice we applied them all of the time.</p>
<p>We long ago stopped having Solaris login servers or using a Solaris
machine as the password master (that ended at the start of 2010,
which is later than I expected and had vaguely remembered; at that
point our Ubuntu environment was several years old). At the point
where our password master became an Ubuntu server, all of that
remapping of Solaris shell paths was unnecessary. However, our
configuration files for password mangling have faithfully preserved
those boiler plate directives for the Solaris shell path rewriting:</p>
<blockquote><pre style="white-space: pre-wrap;">
@hdir: newhomedir /u fixlocalshell fixadmshell
@all: fixadmshell
</pre>
</blockquote>
<p>These 'fixlocalshell' and 'fixadmshell' directives are the lingering
remains of that Solaris compatibility. They've been unneeded for
more than a decade, but we never really noticed them and so they
stayed. They would still be an ignored layer of now-unneeded
compatibility if I hadn't wound up re-working some of the documentation
for the program today, and in the process realized that we could
and should take them out.</p>
<p>(We should remove them from the configuration file because they're
confusing noise, especially if you don't work with this program very
often and so you have to try to remember what all of the directives
do.)</p>
<p>Are there other places with lingering pieces of compatibility with
Solaris and other now-gone things in our environment? Probably. We
don't particularly look for these things, and often our eyes probably
just pass over them as a background thing that we're accustomed to.
It's how things are done, and we don't think too much about it on
a day to day basis (in other words, <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/ProgrammingViaSuperstition">it's sort of a superstition</a>, and <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SysadminsAndSuperstitions">also</a>).</p>
</div>
Compatibility lingers long after it's needed (until it gets noticed)2024-02-26T21:43:53Z2024-02-10T04:22:10Ztag:cspace@cks.mef.org,2009-03-24:/blog/unix/XOffscreenIconMistakecks<div class="wikitext"><p>One of the somewhat odd things about my old fashioned <a href="https://en.wikipedia.org/wiki/X_Window_System">X Window
System</a> environment
is that when I 'iconify' or 'minimize' a window, it (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/HowIUseFvwmIconMan">mostly</a>) winds up as an actual icon on my
root window (what in some environments would be called the desktop),
<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/XIconificationManyWays">in contrast to the alternate approach where the minimized window
is represented in some sort of taskbar</a>.
I have strong opinions about where some of these icons should go,
and <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/ProgrammerLaziness">some tools</a> to automatically
arrange this for various windows, including <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/EmailToolsAffectMyBehavior">the GNU Emacs windows
I (now) use for reading email</a>.</p>
<p>Recently I started a new MH-E GNU Emacs session from home, did some
stuff, iconified it, and GNU Emacs disappeared entirely. There were
no windows and no icons. I scratched my head, killed the process,
started it up again, and the same thing happened all over again.
Only when I was starting to go through the startup process a third
time did I realize what was going on and the mistake I'd made. You
see, <strong>I'd told <a href="https://fvwm.org/">my window manager</a> to put the
GNU Emacs icons off screen</strong> (to the right) and my window manager
had faithfully obliged me. Normally I could have recovered from
this by moving <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/MyVirtualScreenUsage">my virtual screen</a>
over to the right of where it had previously been, but I'd also
told my window manager to position the icons for GNU Emacs relative
to the current virtual screen, not the one it had been iconified
on.</p>
<p>(In <a href="https://fvwm.org/">fvwm</a> terms, I'd set GNU Emacs to have a
'sticky' icon, which normally means that it stays on your screen
as you <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/DualDisplayVsMultiDesktop">move around between virtual screens</a>.)</p>
<p>How I could do this starts with how I was setting the icon position
for the GNU Emacs I was reading email. Unlike (some) X programs,
GNU Emacs doesn't take icon positions as a command line argument
(as far as I know), but it does support <a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Position-Parameters.html">setting icon positions
through Lisp</a>.
However, I use GNU Emacs on one of our servers to read my email
(<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/LatencyImpactMyXExperience">with X forwarding</a>) from
both my work desktop and my home desktop, and they have different
display configurations; work has two side by side 4K displays, with
the GNU Emacs icons on the right display, and at home I have a
single display (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/DualDisplayVsMultiDesktop">and I make more use of multiple virtual screens</a>). Since the icons are
positioned at different spots, I have two Lisp functions to set the
icon position ('home-icon-position' and 'work-icon-position', more
or less).</p>
<p>So that morning, what I did was I started GNU Emacs from my home
machine and ran the 'work-icon-position', which told my window
manager (via GNU Emacs) that I wanted the icon to have the left
position of '5070' pixels. Since I was using a single display that
is only 3840 pixels wide, fvwm dutifully carried out my exact
instructions and put the icon 1230 pixels or so off to the right
of my actual display.</p>
<p>(And then fvwm kept the icon 1230 pixels off the right side when I
switched virtual screens, because that's also what I'd told fvwm
to do.)</p>
<p>Icons are (little) windows and X is perfectly happy to let you
position windows off screen (in any direction, you can put windows
at negative coordinates if you want). As you'd expect, a window
that is positioned entirely off screen isn't visible. So the actual
mechanics of this icon position setting was no problem, and <a href="https://fvwm.org/">fvwm</a>
isn't the kind of program that second-guesses you when you position
an icon off screen. So when I positioned the GNU Emacs icons off
screen, fvwm put them off screen and they disappeared.</p>
<p>PS: I could have recovered the iconified Emacs in various ways, for
example by locating it in various ways and having it deiconify, or
explicitly moving its icon back onto the screen. It was just simpler
and faster, in my state that morning, to terminate an Emacs I hadn't
done much with and try again.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/XOffscreenIconMistake?showcomments#comments">One comment</a>.) </div>Accidentally making windows vanish in my old-fashioned Unix X environment2024-02-26T21:43:52Z2024-02-09T04:13:17Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/MyMachineDesires2024cks<div class="wikitext"><p>My current <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">work desktop</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2018">home desktop</a> are getting somewhat long in the tooth, which has
caused me to periodically think about what I'd want in new hardware
for them. Sometimes I even look at potential hardware choices for
such a replacement desktop (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/CPUIGPCoolingAdvantage">which can lead to grumbling</a>). Today I want to write down my
ideal broad specifications for such a new desktop, what I'd get if
I could get it all in one spot for an affordable price.</p>
<p>In addition to all of the expected things (like onboard sound),
I'd like:</p>
<ul><li>64 GB of RAM instead of my current 32 GB. It would be nice if it
was ECC RAM in a system that genuinely supported it, and it would
also be nice if it was fast, but those two attributes are often in
opposition to each other.<p>
(Today I suspect this means choosing DDR5 over DDR4.)<p>
</li>
<li>Three motherboard M.2 NVMe drive slots. I'd like three because I
currently have a mirrored pair of NVMe drives, and having a third
slot would let me replace one of the live two without having to
pull it outright. Two motherboard M.2 NVMe slots (both operating
at PCIe x4) is probably my minimum these days, and I already have
a PCIe M.2 NVMe card for the current work desktop.<p>
My work desktop has 500 GB NVMe drives currently and I'd like to
get bigger ones. My home desktop is fine with its current drives.<p>
</li>
<li>At least four SATA ports and ideally more. My office desktop has
two SSDs and a SATA DVD-RW drive (because we still sometimes use
those), and I want to be able to run three SSDs at once while
replacing one of the two SSDs. Six SATA ports would be better,
so perhaps I should say I can live with four SATA ports but I'd
like six.<p>
(My home desktop will also need three SATA ports on a routine
basis with a fourth available for drive replacement, but that's
for another entry.)<p>
</li>
<li>At least three 1G Ethernet ports for my work desktop. Since I don't
think there are any reasonable desktop motherboards with this
many Ethernet ports, this needs at least a dual-port PCIe card
and perhaps a quad-port card, which I already have at work. It
also needs a suitable PCIe slot to be free and usable given any
other cards in the machine. My home desktop can get by with one
port but I'd probably like to have two or three there too.<p>
(I wouldn't need that many but <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/VirtManagerMySetupSoFar">Linux's native virtualization
works best if you give it its own network port</a>.)<p>
Although various desktop motherboards have started offering
speeds above 1G (although often not full 10G-T), our work
wiring situation is such that there's no real prospect of
taking advantage of that any time soon. But if a motherboard
comes with '2.5G' or '5G' networking with a chipset that's
decent and well supported by Linux, I wouldn't say no.<p>
</li>
<li>At least two DisplayPort and/or HDMI outputs that support at least
4K at 60 Hz, and I'd like more for future-proofing. I would prefer
two DisplayPort outputs to a DisplayPort + HDMI pairing; this is
readily available in GPU cards but not really in motherboards and
integrated graphics. At work I currently have two 27" HiDPI
displays and at home I currently have one; in both locations the
biggest constraint on larger displays or more of them is physical
space.<p>
(I'd love it if we were moving into a bright future of high
resolution, high DPI, high refresh rate displays, but I don't
think we are, so I don't really expect to want more than dual 4K
at 60Hz for the next half decade or more. It's possible this is
too pessimistic and there are viable 5K+ monitors that I might
want at home in place of my current 27" 4K HiDPI display.)<p>
</li>
<li>Open source friendly graphics, which in practice excludes Nvidia
GPUs (especially if I care about good Wayland support), and
possibly the discrete Intel GPU cards (I'm not sure of their
state). I think anything reasonably modern will support whatever
OpenGL features Wayland needs or is likely to need. The easy way
to get this might well be integrated graphics on a current
generation CPU, assuming I can get the output ports that I want.<p>
On the other hand, the Intel ARC A380 seems to be okay on Linux
(from some Internet searches), and while it has a fan it's alleged
to be able to operate very quietly. It would give me the multiple
DisplayPort outputs and high resolution, high refresh rate support.<p>
</li>
<li>A decent number of both USB-A and USB-C ports. I'd like a reasonable
number of USB-A ports because I still have a lot of USB-A things
and I'd like not to have a whole collection of USB-A hubs sitting
around on my either my office or my home desk. But probably more
hubs (or larger ones) is in my future.</li>
</ul>
<p>I'd like it if the machine still supported old fashioned BIOS MBR
booting and didn't require (U)EFI booting (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/BIOSMBRBootingOverUEFI">I have my reasons</a>), although UEFI booting is
probably better on desktop motherboards <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/MBRToUEFIBootFailure">than it used to be</a>. The UEFI story for people who want booting
from mirrored pairs of drives may be better on Fedora than it used
to be, since <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Ubuntu2204MultiDiskUEFI">Ubuntu 22.04 has some support for duplicate UEFI
boot partitions</a>.</p>
<p>(I'm absolutely not interested in trying to mirror the EFI System
Partition behind the back of the UEFI BIOS.)</p>
<p>It would be nice to get a good CPU performance increase from my
current desktops, but on the one hand I sort of assume that any
decent desktop CPU today is going to be visibly better than something
from more than five years ago, and on the other hand I'm not sure
how noticeable the performance improvement is these days, and on the
third hand <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/ChangingComputerPerformance">I've been wrong before</a>.
If my current (five year old) desktops have reached the point where
CPU performance mostly doesn't matter to me, then I'd probably
prefer to get a midrange CPU with decent thermal performance and
perhaps no funny slow 'efficiency' cores that can give you and
Linux's kernel CPU scheduling various sorts of heartburn. On the
other hand, my Firefox build times keep getting slower and slower,
so I suspect that the world of software just assumes current CPUs
and current good performance.</p>
<p>PS: I have no plans to do GPU computation on my desktops, for a
variety of reasons including that I don't want to deal with Nvidia
GPUs in my machines. If I need to do GPU stuff for work, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SlurmHowWeUseIt">our SLURM
cluster</a> has GPUs, and I don't have to
care how much power they use, how noisy they are, and how much heat
they put out because they're in the machine room (and I'm not).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/MyMachineDesires2024?showcomments#comments">7 comments</a>.) </div>What I'd like in a hypothetical new desktop machine in 20242024-02-26T21:43:53Z2024-02-08T04:50:44Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/NFSv4MaxConnectEffectscks<div class="wikitext"><p>Suppose, not hypothetically, that you've converted your fleet from
using NFS v3 to using <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSv4BasicsJustWork">basic Unix security NFS v4 mounts</a> when they mount their hordes of NFS filesystems
from <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">your NFS fileservers</a>. When your NFS
clients boot or at some other times, you notice that you're getting
a bunch of copies of a new kernel message:</p>
<blockquote><pre style="white-space: pre-wrap;">
SUNRPC: reached max allowed number (1) did not add transport to server: <IP address>
</pre>
</blockquote>
<p>Modern NFS uses TCP, which means that the NFS client needs to make
some number of TCP connections to each NFS server. In NFS v3, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSOneTCPConnectionToAServer">Linux
normally only makes one connection to each server</a>. The same is sort of true in NFS v4
as well, but NFS v4 is more complex about what is 'a server'. In
NFS v3, servers are identified by at least their IP address (and
perhaps their name; I'm not sure if two different names that map
to the same IP will share the same connection). In NFS v4.1+, servers
have some sort of intrinsic identity that is visible to clients
even if you're talking to them by multiple IP addresses.</p>
<p>This new 'reached max allowed number (<N>) did not add transport
to server' kernel message is reporting about this case. You (we)
have a single NFS server that for historical reasons has two different
IPs, one for most of its filesystems and one for <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OurPasswordPropagation">our central
administrative filesystem</a>,
and now NFS v4 considers these the 'same' server and won't make an
extra connection to the second IP.</p>
<p>You might wonder if you can change this, and the answer is that you
can but it gets complex and I'm not quite sure how it all works to
distribute the actual NFS traffic. There appear to be two interlinked
things that you can control; how many connections a NFS v4 client
will make to a single NFS server, and how many different IPs of the
server that NFS v4 client will connect to. How many connections NFS
v4 will make to a single server is mostly controlled by <a href="https://man7.org/linux/man-pages/man5/nfs.5.html">nfs(5)</a>'s <code>nconnect</code>
setting, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/NFSv3NConnectEffects">sort of like <code>nconnect</code>'s behavior with NFS v3</a>. How many connections NFS v4 will make to
separate client IPs is controlled by '<code>max_connect</code>'. Both of
these default to 1. However, how they interact is confusing and I'm
not sure I fully understand it.</p>
<p>The easy case is not setting nconnect and setting max_connect
to at least as many different IP aliases as you have for each
fileserver. In this case you'll get one TCP connection per server
IP (although don't ask me what traffic flows over what connection).
If you set nconnect without max_connect, you'll get however
many connections to the first IP address of each server (well, the
first IP address that the client finds), assuming that you mount
at least that many NFS filesystems from that server.</p>
<p>However, if you set both nconnect and max_connect, what seems
to happen (on Ubuntu 22.04) is that you get nconnect TCP
connections to each server's first (encountered) IP address, and
then one TCP connection to every other IP address (up to the
max_connect limit). This is why I described 'nconnect' as
controlling how many connections NFS v4 would make to a single
server, instead of a single server IP (or name). It would be a bit
more useful if you could set nconnect on a per-IP (or name) basis
in NFS v4, or otherwise make it so that the first IP didn't get all
of the connections.</p>
<p>(This is apparently called 'trunking' in NFS v4, per <a href="https://datatracker.ietf.org/doc/html/rfc5661#section-2.10.5">RFC 5661
section 2.10.5</a>
(<a href="https://www.truenas.com/community/threads/nfsv4-1-session-trunking-multipath-support-not-nconnect-or-pnfs.112215/">via</a>).)</p>
</div>
What the <code>max_connect</code> Linux NFS v4 mount parameter seems to do2024-02-26T21:43:53Z2024-02-07T03:49:05Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/TrackingMachineImportancecks<div class="wikitext"><p>Today <a href="https://mastodon.social/@cks/111881322231361439">we had a significant machine room air conditioning failure
in our main machine room</a>,
one that certainly couldn't be fixed on the spot ('glycol all over
the roof' is not a phrase you really want to hear about your AC's
chiller). To keep the machine room's temperature down, we had to
power off as many machines as possible without too badly affecting
the services we offer to people here, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OurDifferentSysadminEnvironment">which are rather varied</a>. Some choices were obvious; all
of <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SlurmHowWeUseIt">our SLURM nodes</a> that were in the main machine
room got turned off right away. But others weren't things we
necessarily remembered right away or we weren't clear if they were
safe to turn off and what effects it would have. In the end we took
several rounds of turning servers off, looking at what was left,
spotting remaining machines, and turning more things off, and we're
probably not done yet.</p>
<p>(We have secondary machine room space and we're probably going to
have to evacuate servers into it, too.)</p>
<p>One thing we could do to avoid this flailing in the future is to
explicitly (try to) keep track of which machines are important and
which ones aren't, to pre-plan which machines we could shut down
if we had a limited amount of cooling or power. If we documented
this, we could avoid having to wrack our brains at the last minute
and worry about dependencies or uses that we'd forgotten. Of course
documentation isn't free; there's an ongoing amount of work to write
it and keep it up to date. But possibly we could do this work as
part of deploying machines or changing their configurations.</p>
<p>(This would also help identify machines that we didn't need any
more but hadn't gotten around to taking out of service, which we
found a couple of in this iteration.)</p>
<p>Writing all of this just in case of further AC failures is probably
not all that great a choice of where to spend our time. But writing
down this sort of thing can often help to clarify how your environment
is connected together in general, including things like what will
probably break or have problems if a specific machine (or service)
is out, and perhaps which people depend on what service. This can
be valuable information in general. The machine room archaeology
of 'what is this machine, why is it on, and who is using it' can
be fun occasionally, but you probably don't want to do it regularly.</p>
<p>(Will we actually do this? I suspect not. When we deploy and start
using a machine its purpose and so on feel obvious, because we have
all of the context.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/TrackingMachineImportance?showcomments#comments">9 comments</a>.) </div>We might want to regularly keep track of how important each server is2024-02-26T21:43:53Z2024-02-06T04:14:53Ztag:cspace@cks.mef.org,2009-03-24:/blog/python/DjangoExplicitImportsSwitchcks<div class="wikitext"><p>When I wrote <a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoORMDesignPuzzleII">our Django application</a> it
was a long time ago, I didn't know Django, and I was sort of in a
hurry, so I used what I believe was the style at the time for Django
of often doing broad imports of things from both Django modules
and especially the application's other modules:</p>
<blockquote><pre style="white-space: pre-wrap;">
from django.conf.urls import *
</pre>
<pre style="white-space: pre-wrap;">
from accounts.models import *
</pre>
</blockquote>
<p>This wasn't universal; even at the time it was apparently partly
the style to import only specific things from Django modules, and
I followed that style in our code.</p>
<p>However, <a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoAppNowPython3">when I moved the application to Python 3</a>
I also switched all of these over to specific imports. This wasn't
required by Django (or by Python 3); instead, I did it because it
made my editor complain less. Specifically it made Flycheck in GNU
Emacs complain less (in my setup). I decided to do this change
because I wanted to use Flycheck's list of issues to check for
other, more serious issues, and because Flycheck specifically listed
all of the missing or unknown imports. Because Flycheck listed them
for me, I could readily write down everything it was reporting and
see the errors vanish. When I had everything necessary imported,
Flycheck was nicely quiet (about that).</p>
<p>Some of the import lines wound up being rather long (as you can
imagine, the application's views.py uses a lot of things from our
models.py). Even still, this is probably better for a future version
of me who has to look at this code later. Some of what comes from
the application models is obvious (like core object types), but not
all of it; I was using some imported functions as well, and now the
imports explicitly lists where they come from. And for Django
modules, now I have a list of what I'm using from them (often not
much), so if things change in a future Django version (such as the
move from django.conf.urls to django.urls), I'll be better placed
to track down the new locations and names.</p>
<p>In theory I could have made this change at any time. In practice,
I only made it once I'd configured GNU Emacs for good Python editing
and learned about Flycheck's ability to show me the full error list.
Before then all of the pieces were two spread apart and too awkward
for me to reach for.</p>
<p>(Of course, this isn't the first time that my available tools have
influenced how I programmed in a way that I noticed.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoExplicitImportsSwitch?showcomments#comments">One comment</a>.) </div>I switched to explicit imports of things in our Django application2024-02-26T21:43:53Z2024-02-05T02:50:01Ztag:cspace@cks.mef.org,2009-03-24:/blog/python/DjangoSolvingProblemSidewayscks<div class="wikitext"><p>A few years ago I wrote about <a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoErrorPropagationIssue">an issue with propagating some
errors in our Django application</a>. We
have two sources of truth for user authorization, one outside of
Django (in Unix group membership that was used by <a href="https://utcc.utoronto.ca/~cks/space/blog/web/ApacheBasicAuthWhy">Apache HTTP
Basic Authentication</a>), and one inside
Django in a 'users' table; these two can become desynchronized,
with someone in the Unix group but not in the application's users
table. The application's 'retrieve a user record' function either
returns the user record or raises an Http404 exception that Django
automatically handles, which means that someone who hasn't been
added to the user table will get 404 results for every URL, which
isn't very friendly. I wanted to handle this by finding a good way
to render a different error page in this case, either by customizing
what the 'Http404' error page contained or by raising a different
error.</p>
<p>All of this is solving the problem in the obvious way and also a
cool thing to (try to) do in Django. Who doesn't want to write
Python code that handles exceptional cases by, well, raising
exceptions and then having them magically caught and turn into
different rendered pages? But Django doesn't particularly support
this, although I might have been able to add something by writing
an application specific piece of Django middleware that worked by
catching our custom 'no such user' exception and rendering an
appropriate template as the response. However, this would have been
my first piece of middleware, so I held off trying anything here
until <a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoAppNowPython3">we updated to a modern version of Django</a>
(partly in the hopes it might have a solution).</p>
<p>Then, recently a simpler but rather less cool option to deal with
this whole issue occurred to me. We have a Django management command
that checks our database for consistency in various ways (for
example, unused records of certain types, or people in the application's
users table who no longer exist), which we run every night (from
cron). Although it was a bit of a violation of 'separation of
concerns', I could have that command know about the Unix group(s)
that let people through Apache, and then have it check that all of
the group members were in the Django user table. If people were
omitted, we'd get a report. This is pretty brute force and there's
nothing that guarantees that the command's list of groups stays in
synchronization with our Apache configuration, but it works.</p>
<p>It's also a better experience for people than the cool way I was
previously considering, because it lets us proactively fix the
problem before people encounter it, instead of only reactively
fixing it after someone runs into this and reports the issue to us.
Generally, we'll add someone to the Unix group, forget to add them
to Django, and then get email about it the next day before they'll
ever try to use the application, letting us transparently fix our
own mistake.</p>
<p>(This feels related to <a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoAdminNote">something I realized very early about not
trying to do everything through Django's admin interface</a>.)</p>
</div>
Solving one of our Django problems in a sideways, brute force way2024-02-26T21:43:53Z2024-02-04T02:44:04Ztag:cspace@cks.mef.org,2009-03-24:/blog/programming/MHENarrowToPendingcks<div class="wikitext"><p>I recently switched from reading my email with <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/ToolsEmail">exmh</a>, a graphical X frontend to <a href="https://www.nongnu.org/nmh/">(N)MH</a>, to reading it with <a href="https://www.gnu.org/software/emacs/manual/html_mono/mh-e.html">MH-E</a> in
GNU Emacs, which is also a frontend to <a href="https://www.nongnu.org/nmh/">(N)MH</a>. I had a certain
amount of customizations to exmh, and for reasons beyond the scope
of this entry, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/CustomizationSensibleLimits">I wound up with more for MH-E</a>. One of those customizations
is a new MH-E command (and keybinding for it), <code>mh-narrow-to-pending</code>.</p>
<p>Both exmh and MH-E process deleting messages and refiling them to
folders in two phases. In the first phase your read your email and
otherwise go over the current folder, marking messages to be deleted
and refiled; once you're satisfied, you tell them to actually execute
these pending actions. MH-E also a general feature to <a href="https://www.gnu.org/software/emacs/manual/html_mono/mh-e.html#Limits">limit what
messages are listed in the current folder</a>. In
Emacs jargon this general idea is known as <em>narrowing</em>, and there's
various tools to 'narrow' the display of buffers to something of
current interest. My customization narrows to show only the messages
in the current folder that have pending actions on them; these are
the messages that will be affected if you execute your pending
actions.</p>
<p>So here's the code:</p>
<pre style="white-space: pre-wrap;">
(defun mh-narrow-to-pending ()
"Narrow to any message with a pending refile or delete."
(interactive)
(if (not (or mh-delete-list mh-refile-list))
(message "There are no pending deletes or refiles.")
(when (assoc 'mh-e-pending mh-seq-list) (mh-delete-seq 'mh-e-pending))
(when mh-delete-list (mh-add-msgs-to-seq mh-delete-list 'mh-e-pending t t))
(when mh-refile-list
(mh-add-msgs-to-seq
(cl-loop for folder-msg-list in mh-refile-list
append (cdr folder-msg-list))
'mh-e-pending t t))
(mh-narrow-to-seq 'mh-e-pending)))
</pre>
<p>(This code could probably be improved, and reading it I've discovered
that I've already forgotten what parts of it do and the details of how
it works, although the broad strokes are obvious.)</p>
<p>Writing this code required reading the existing MH-E code to find
out how it did narrowing and how it marked messages that were going
to be refiled or deleted. In the usual GNU Emacs way, this is not
a documented extension API for MH-E, although in practice it's
unlikely to change and break my code. To the best of my limited
understanding of making your own tweaks for GNU Emacs modes like
MH-E, this is basically standard practice; generally you grub around
in the mode's ELisp source, figure things out, and then do things on
top of it.</p>
<p>There are two reasons that I never tried to write something like
this for exmh. The first is that exmh doesn't do anywhere near as
much with the idea of 'narrowing' the current folder display. The
other is that <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/EmailToolsAffectMyBehavior">I wound up using the two differently</a>. In MH-E, it's become quite
common for me to pick through my inbox (or sometimes other folders)
for messages that I'm now done with, going far enough back in one
pass that I wind up with a sufficient patchwork that I want to
double check what exactly I'm going to be doing before I commit my
changes. Since I can easily narrow to messages in general, narrowing
to see these pending changes was a natural idea.</p>
<p>(Picking through the past week or more of email threads in my inbox
has become a regular Friday activity for me, especially given that
<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/EmailToolsAffectMyBehavior">MH-E has a nice threaded view</a>.)</p>
<p>It's easy to fall into the idea that any readily extendable program
is kind of the same, because with some work you can write plugins,
extensions, or other hacks that make it dance to whatever tune you
want. What my experience with extending MH-E has rubbed my nose
into is that the surrounding context matters in practice, both in
how the system already works and in what features it offers that
are readily extended. 'Narrow to pending' is very much an MH-E hack.</p>
</div>
One of my MH-E customizations: 'narrow-to-pending' (refiles and deletes)2024-02-26T21:43:52Z2024-02-03T03:57:05Ztag:cspace@cks.mef.org,2009-03-24:/blog/python/DjangoAppNowPython3cks<div class="wikitext"><p>We have <a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoORMDesignPuzzleII">a long standing Django web application</a>
to handle the process of people requesting Unix accounts <a href="https://support.cs.toronto.edu/">here</a> and having the official sponsor
of their account approve it. For a long time, this web app was stuck
on Python 2 and Django 1.10 after <a href="https://utcc.utoronto.ca/~cks/space/blog/python/Django111CSRFFailures">a failed attempt to upgrade to
Django 1.11 in 2019</a>. Our reliance on Python
2 was obviously a problem, and with the not so far off end of life
of Ubuntu 20.04 it was getting more acute (we use Apache's
mod_wsgi, and <a href="https://utcc.utoronto.ca/~cks/space/blog/python/Ubuntu2204PythonState">Ubuntu 22.04 and later don't have a Python 2
version of that for obvious reasons</a>).
Recently I decided I had to slog through the process of moving to
Python 3 and a modern Django (one that is actually supported) and
it was better to start early. To my pleasant surprise the process
of bringing it up under Python 3 and Django 4.2 was much less work
than I expected, and <a href="https://mastodon.social/@cks/111824260017176518">recently we migrated the production version</a>. At this point
it's been running long enough (and has done enough) that I'm calling
this upgrade a success.</p>
<p>There are a number of reasons for this smooth and rapid sailing.
For a start, <a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoAppPython3Surprise">it turns out that my 2019 work to bring the app up
under Python 3</a> covered most of the work
necessary, although <a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoPython3FieldEncodingGotcha">not all of it</a>.
Our previous problems with <a href="https://en.wikipedia.org/wiki/Cross-site_request_forgery">CSRF</a> and
Apache HTTP Basic Authentication have either been sidestepped by
Django changes since 1.11 or perhaps <a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoCSRFAndSessions">mitigated by Django configuration
changes based on a greater understanding of this area</a> that I worked out two years ago. And <a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoGoalsNotOurGoals">despite
some grumpy things I've said about Django in the past</a>, our application needed very few changes to
go from Django 1.10 to Django 4.2.</p>
<p>(Most of the Django changes seem to have been moving from 'load
staticfiles' to 'load static' in templates, and replacing use of
django.conf.urls.url() with django.urls.re_path(), although we
could probably do our URL mapping better if we wanted to. There are
other minor changes, like importing functions from different places,
changing request.POST.has_key(X) to X in request.POST, and
defining DEFAULT_AUTO_FIELD in our settings.)</p>
<p>Having this migration done and working takes a real load off of my
mind for the obvious reasons; neither Python 2 nor Django 1.10 are
what we should really be using today, even if they work, and now
we're free to upgrade the server hosting this web application beyond
Ubuntu 20.04. I'm also glad that it took relatively little work now.</p>
<p>(Probably this will make me more willing to keep up to date with
Django versions in the future. We're not on Django 5.0 because it
requires a more recent version of Python 3 than Ubuntu 20.04 has,
but that will probably change this summer or fall as we start
upgrades to Ubuntu 24.04.)</p>
</div>
Our Django application is now using Python 3 and a modern Django2024-02-26T21:43:53Z2024-02-02T04:06:25Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/IPv6NowReliableForMecks<div class="wikitext"><p>I've had IPv6 at home for a long time, first in tunneled form and
<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/IPv6ComplicationsAgain">later in native form</a>, and recently I
brought up <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/Ubuntu2204WireGuardIPv6Gateway">more or less native IPv6 for my work desktop</a>. When I first started
using IPv6 (at home) and for many years afterward, there were <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/IPv6IsGoingToBeFun">all</a> <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/IPv6ConfigurationFun">sorts</a> of
complications and <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/IPv6ComplicationsAgain">failures</a> that could
be attributed to IPv6 or that went away when I turned off IPv6. To
be honest, when I enabled IPv6 on my work desktop I expected to run
into a fun variety of problems due to this, since before then it
had been IPv4 only.</p>
<p>To my surprise, my work desktop has experienced no problems since
enabling IPv6 connectivity. I know I'm using some websites over
IPv6 and I can see IPv6 traffic happening, but at the personal
level, I haven't noticed anything different. When I realized that,
I thought back over my experiences at home and realized that it's
been quite a while since I had a problem that I could attribute to
IPv6. Quietly, while I wasn't particularly noticing, the general
Internet IPv6 environment seems to have reached a state where it
just works, at least for me.</p>
<p>Since <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/IPv6IsTheFuture">IPv6 is everyone's future</a>, this is good
news. We've been collectively doing this for long enough and IPv6
usage has climbed enough that it should be as reliable as IPv4, and
hopefully people don't make <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/IPv6ConfigurationFun">common oversights</a> any more. Otherwise, we would
collectively have a real problem, because turning on IPv6 for more
and more people would be degrading the Internet experience of more
and more people. Fortunately that's (probably) not happening any
more.</p>
<p>I'm sure that there are still IPv6 specific issues and problems
that come up, and there will be more for a long time to come (until
perhaps they're overtaken by year 2038 problems). But t you can
have problems that are specific to anything, including IPv4 (and
people may already be having those).</p>
<p>(As more people add IPv6 to servers that are currently IPv4 only,
we may also see a temporary increase in IPv6 specific problems as
people go through 'learning experiences' of operating IPv6 environments.
I suspect that <a href="https://support.cs.toronto.edu/">my group</a> will
have some of those when we eventually start adding IPv6 to various
parts of our environment.)</p>
</div>
Using IPv6 has quietly become reliable (for me)2024-02-26T21:43:52Z2024-02-01T03:26:21Ztag:cspace@cks.mef.org,2009-03-24:/blog/python/VenvsAndEmbeddedPythoncks<div class="wikitext"><p>When I wrote about <a href="https://utcc.utoronto.ca/~cks/space/blog/python/PythonVenvAndLSPServer">getting the Python LSP server working with venvs
in a brute force way</a>, Ian Z aka nobrowser
commented (and I'm going to quote rather than paraphrase):</p>
<blockquote><p>I'd say that venvs themselves are "aesthetically displeasing". After
all, having a separate Python executable for every project differs
from having a separate LSP in degree only.</p>
</blockquote>
<p>On Unix, <a href="https://utcc.utoronto.ca/~cks/space/blog/python/VenvsAndPythonBinary">this separate executable is normally only a symbolic
link</a>, although other platforms may differ
and the venv normally will have its own copy of pip, setuptools,
and some other things, which can amount to 20+ Mbytes even on Linux.
However, when I thought about it, I don't think there's any good
option other than for the venv to have its own (nominal) copy of
Python. The core problem is that <a href="https://utcc.utoronto.ca/~cks/space/blog/python/VenvsAndSysPath">venvs are very convenient when
they're more or less transparently activated</a>.</p>
<p>A Python venv is marked by a special file in the root of the venv,
<a href="https://docs.python.org/3/library/venv.html">pyvenv.cfg</a>. There
are two ways that Python could plausibly decide when to automatically
activate a venv without you having to set any environment variables;
it can look around the environment of the Python executable you ran
for this marker (which is what it does today), or it could look
around the environment of your current directory, traversing up the
filesystem to see if it could find a pyvenv.cfg (in much the same
way that version control systems look for their special .git or .hg
directory to mark the repository root).</p>
<p>The problem with automatically activating a venv based on what you
find in the current directory and its parents is that it makes
Python programs (and the Python interpreter) behave differently
depending on where you are when you run them, including random
system utilities that just happen to be written in Python. If the
program requires any packages beyond the standard library, it may
well fail outright because those packages aren't installed in the
venv, and if they are installed in the venv they may not be the
version the program needs or expects. This isn't a particularly
good experience and I'm pretty confident that people would be very
unhappy if this was what Python did with venvs.</p>
<p>The other option is to not automatically activate venvs at all and
always require you to set environment variables (or the local
equivalent). The problem for this is that it's a terrible experience
for actually using venvs to, for example, deploy programs as
encapsulated entities. You can't just ship the venv and have people
run programs that have been installed into its bin/ subdirectory;
now they need cover scripts to set the venv environment variables
(which might be automatically generated by pip or whatever, but
still).</p>
<p>So on the whole embedding the Python interpreter seems the best
choice to me. That creates a clear logic to which venv is automatically
activated, if any, that can be predicted by people; it's the venv
whose Python you're running. Of course I wish it didn't take all
of that disk space for extra copies of pip and setuptools, but you
can't have everything.</p>
</div>
Putting a Python executable in venvs is probably a necessary thing2024-02-26T21:43:53Z2024-01-31T02:28:13Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/GrafanaLokiStartupWALReplayIssuecks<div class="wikitext"><p>We have now had two instances where restarting <a href="https://grafana.com/oss/loki/">Grafana Loki</a> caused it to stop working.
Specifically, shortly after restart, Loki began logging a flood of
mysterious error messages of the form:</p>
<blockquote><pre style="white-space: pre-wrap;">
level=warn ts=2024-01-29T19:01:30.[…]Z caller=logging.go:123 [...] msg="POST /loki/api/v1/push (500) 148.309µs Response: \"empty ring\\n\" [...] User-Agent: promtail/2.9.4; [...]"
</pre>
</blockquote>
<p>This is obviously coming from <a href="https://grafana.com/docs/loki/latest/send-data/promtail/">promtail</a> trying
to push logs into Loki, but I got 'empty ring' errors from trying
to query the logs too. With a flood of error messages and these
messages not stopping, both times I resorted to stopping Loki and
<a href="https://mastodon.social/@cks/110272386392078438">deleting and restarting</a>
its <a href="https://mastodon.social/@cks/111840849857344468">log database</a>
(which we've also had to do for <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLokiNoChunkCompaction">other reasons</a>).</p>
<p>As far as I can tell from Internet searches, what Loki's 'empty
ring' error message actually means here is that some component of
Loki has not (yet) started properly. Although we operate it in <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLokiSimpleNotRecommended">an
all in one configuration that I can't recommend</a>, Loki is 'supposed' to be operated
as a cooperative fleet of <a href="https://grafana.com/docs/loki/latest/get-started/components/">a whole horde of individual microservices</a>, or
<a href="https://grafana.com/blog/2023/12/28/the-concise-guide-to-loki-how-to-get-the-most-out-of-your-query-performance/">at least as three separate services ("three-target" mode)</a>.
To operate in these modes with possibly multiple instances of each
(micro)service, Loki uses <a href="https://grafana.com/docs/loki/latest/get-started/hash-rings/">hash rings</a> to
locate which instance of a particular component should be used.
When Loki reports an 'empty ring' error, what it means is that
there's nothing registered in the hash ring it attempted to use to
find an instance. Which hash ring? Loki doesn't tell you; you're
presumably expected to deduce it from context. Although we're
operating Loki in an all-in-one configuration, Loki apparently still
internally has hash rings (most of them probably with exactly one
thing registered) and those hash rings can be empty if there are
issues.</p>
<p>(As best I can tell from Loki metrics, our current configuration
has hash rings for the ingester, the (index) compactor, and the
(query?) scheduler.)</p>
<p>Since this error comes from promtail log pushing, the most likely
component to have not registered itself is the <a href="https://grafana.com/docs/loki/latest/get-started/components/#ingester">ingester</a>,
which receives and processes incoming log lines, eventually writing
them to your <em>chunk storage</em>, which in our case is the filesystem.
The ingester doesn't immediately write each new log line to storage;
instead it aggregates them into those chunks in memory and then
writes chunks out periodically (when they are big enough or old
enough). To avoid losing chunks that aren't yet full if Loki is
stopped for some reason, the ingester uses a <a href="https://grafana.com/docs/loki/latest/operations/storage/wal/">write ahead log (WAL)</a>. As
the Loki documentation says, when the ingester restarts (which in
our case means when Loki restarts), it must replay the WAL into
memory before 'registering itself as ready for subsequent writes'
to quote directly from the documentation. I have to assume that
what Loki really does is that the ingester replays the WAL before
adding itself to the ingester hash ring. So while WAL replay is
happening there is probably no ingesters registered in your ingester
hash ring and attempts to push logs will fail (well, be rejected)
with 'empty ring'.</p>
<p>Due to <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLokiCardinalityProblem">how Loki actually creates and stores chunks and doesn't
compact them</a> we try to have as few
chunks as possible, which means that we have a very long chunk
lifetime and maximum chunk size. This naturally leads to having a
lot of chunks and log data sitting in the memory of our Loki process,
and (probably) a big WAL, although how big will depend partly on
timing. The ingester's WAL can be configured to be flushed on regular
shutdown (the <a href="https://grafana.com/docs/loki/latest/configure/#ingester">flush_on_shutdown wal option</a>) but we
have historically turned this off so that Loki restarts don't flush
out a bunch of small chunks (plus flushing a big WAL will take
time). So after our Loki has been running for long enough, when it
shuts down it will have a large WAL to replay on startup.</p>
<p>So what I believe happened is that our configuration wound up with
a very big ingester WAL, and when Loki started, the ingester just
sat there replaying the WAL (which is actually visible in the Loki
metrics like loki_ingester_wal_replay_active and
loki_ingester_wal_recovered_bytes_total). Since the ingester
was not 'ready', it did not register in the ingester hash ring, and
log pushing was rejected with 'empty ring'. Probably if I had left
Loki alone long enough (while it spewed messages into the log), it
would have finished WAL replaying and all would have been fine.
There's some indication in historical logs that this has actually
happened in the past when we did things like reboot the Loki host
machine for kernel updates, although to a lesser extent than this
time. Deleting and restarting the database fixes the problem for
the obvious reason that with no database there's no WAL.</p>
<p>(This didn't happen on my Loki test machine because my test machine
has far fewer things logging to it, only a couple of other test
machines. And this also explains how <a href="https://mastodon.social/@cks/110272386392078438">the first time around</a>, reverting to our
previous Loki version didn't help. We'd have seen the same problem
if we'd restarted Loki without an upgrade, which is <a href="https://mastodon.social/@cks/111840885225899259">accidentally
what happened this time</a>.)</p>
<p>Probably the most important fix to this is to enable flushing the
WAL to chunk storage on shutdown (along with vastly lengthening
systemd's shutdown timeout for Loki, since this flushing may take
a while). In practice we restart Loki very infrequently, so this
won't add too many chunks (although it will make me more reluctant
to restart Loki), and when it works it will avoid having to replay
the WAL on startup. A related change is to raise the ingester wal
parameter replay_memory_ceiling, because otherwise we'll wind
up flushing a bunch of chunks on startup if we start with a big WAL
(for example, if the machine lost power or crashed). And the broad
fix is to not take 'empty ring' failures seriously unless they last
for quite a long time. How long is a long time? I don't know, but
probably at least ten minutes after startup.</p>
<p>(I believe that <a href="https://grafana.com/docs/loki/latest/send-data/promtail/">promtail</a> will keep retrying after receiving
this reply from Loki, and we have relatively long retry times
configured for promtail before it starts discarding logs. So if
this issue clears after ten or twenty minutes, the only large scale
harm is a massive log spam.)</p>
<p>PS: Based on past experience, I won't know if I'm right for a fairly
long time, probably at least close to a year.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLokiStartupWALReplayIssue?showcomments#comments">2 comments</a>.) </div>What I think goes wrong periodically with our Grafana Loki on restarts2024-02-26T21:43:53Z2024-01-30T02:13:47Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/ServersDroppingSerialPortscks<div class="wikitext"><p>One of the things that <a href="https://support.cs.toronto.edu/">we</a> have
had for a long time is <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/ConsoleServerSetup">a serial console server</a>,
which is to say a server that collects and logs serial console
output from all of our regular servers (this is primarily <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ManyConsolesOfLinux">Linux
kernel messages</a>, as <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/DroppingSerialConsoles">we've moved
away from making the serial port the 'real' kernel console</a>). To date we've done this through actual
serial ports on the servers, which we set up as an additional
console. However, this has an obvious issue, which is that your
servers need actual serial ports.</p>
<p>For a long time this was no problem; even basic 1U servers came
with a serial port (often right besides the basic VGA video out;
servers may be the last refuge of that particular connector).
However, recently we discovered that we've received a mainstream
server that doesn't have such a serial port, and it's not a basic
1U server either; it's the 24-bay SuperMicro hardware for our new
(hardware) generation of fileservers. The only option for a serial
console on this hardware will be Serial-over-LAN as part of its
<a href="https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface">IPMI</a>
implementation.</p>
<p>(We've poked a bit at Serial over LAN before, but there have been
a number of blockers, including that <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/KernelSerialConsoleOnlyOne">Linux only supports sending
kernel messages to a single serial console</a> so to date that's been the
physical serial port.)</p>
<p>This is a switch that I've been vaguely expecting to happen for a
while, but it's still sort of a surprise to see signs of it actually
happening. The physical hardware for a serial port does add some
cost and take up some space, and these days it can't be very heavily
used, and IPMI has supported Serial over LAN for some time now. In
a way it's surprising that serial ports on servers have lasted this
long.</p>
<p>(Or perhaps server serial ports are more popular and heavily used
in the industry than I expect. And certainly IPMI support has had
problems in the past, when vendors often didn't give the <a href="https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface">BMC</a>
its own network port, which generally meant that you had to keep
it off the network entirely.)</p>
<p>PS: This shift is potentially a good thing for us, since the hardware
we use to implement our serial console server is increasingly old
and I don't think equivalents are being made today. It's been very
reliable so far, but sooner or later it's going to start failing.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/ServersDroppingSerialPorts?showcomments#comments">2 comments</a>.) </div>Servers are (probably) starting to drop serial ports2024-02-26T21:43:53Z2024-01-29T05:43:35Ztag:cspace@cks.mef.org,2009-03-24:/blog/python/PythonVenvAndLSPServercks<div class="wikitext"><p>Recently I wound up doing some Django work using a Python <a href="https://docs.python.org/3/library/venv.html">venv</a>, since this is the
easiest way to get a self-contained Python environment that has
some version of Django (or other applications) installed. However,
one part of the experience was a little bit less than ideal. I
normally write Python using GNU Emacs and <a href="https://github.com/python-lsp/python-lsp-server">the Python LSP server</a>, and this environment
was complaining about being unable to find Django modules to do
code intelligence things with them. A little thought told me why;
GNU Emacs was running my regular LSP server, which is installed
through <a href="https://utcc.utoronto.ca/~cks/space/blog/python/PyPyAndPipx">pipx</a> into its own venv, and that venv didn't
have Django installed (of course).</p>
<p>As far as I know, the Python LSP server doesn't have any specific
support for recognizing and dealing with venvs, and in general this
is a bit difficult (the version of Python being used by a venv may
not even match the version that pylsp is running with; in fact this
was my case, since my installed pylsp was using <a href="https://www.pypy.org/">pypy</a> but the Django venv was using the system
Python 3). Rather than try to investigate deeply into this, I decided
to solve the problem with brute force, which is to say that I
installed the Python LSP server (<a href="https://utcc.utoronto.ca/~cks/space/blog/python/PylspBeSelectiveOnPlugins">with the right set of plugins</a>) into the venv, along with all of the
rest of things, and then ran that instance of GNU Emacs with its
$PATH set to use the venv's bin/ directory and pick up everything
there, including its Python 3 and python-lsp-server.</p>
<p>This is a little bit aesthetically displeasing for at least two
reasons. First, the Python LSP server and its plugins and their
dependencies aren't a small thing and anyway they're not a runtime
package dependency, it's purely for development convenience. Second,
the usual style of using GNU Emacs is to start it once and then
reuse that single Emacs instance for everything, which naturally
gives that Emacs instance a single $PATH and make it want to use a
single version of python-lsp-server. I'm okay with deviating from
this bit of Emacs practice, but other people may be less happy.</p>
<p>(A hack that deals with the second issue would be a 'pylsp' cover
script that hunts through the directory tree to see if you're running
it from inside a venv and if that venv has its own 'pylsp' binary;
if both are true, you run that pylsp instead of your regular
system-wide one. I may write this hack someday, partly so that I
can stop having to remember to add the venv to my $PATH any time I
want to fire up Emacs on the code I'm working on in the venv.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/python/PythonVenvAndLSPServer?showcomments#comments">3 comments</a>.) </div>Getting the Python LSP server working with venvs the brute force way2024-02-26T21:43:53Z2024-01-28T02:50:05Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/HistogramsNeedTotalsToocks<div class="wikitext"><p>A true <a href="https://en.wikipedia.org/wiki/Histogram">histogram</a> is
generated from raw data. However, in things like metrics, we generally
don't have the luxury of keeping all of the raw data around; instead
we need to summarize it into histogram data. This is traditionally
done by having some number of buckets with either <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusHistogramsWantSums">independent or
cumulative values</a>. A lot
of systems stop there; for example <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxGettingPoolIostats">OpenZFS provides its histogram
data this way</a>. Unfortunately
by itself this information is incomplete in an annoying way.</p>
<p>If you're generating histogram data, you should go the extra distance
to also provide a true total of all of the raw data. The reason is
simple; only with a true total can one get a genuine and accurate
average value, or anything derived from that average. Importantly,
one thing you can potentially derive from the average value is an
indication of what I'll call skew in your buckets.</p>
<p>The standard assumption when dealing with histograms is that the
values in each bucket are randomly distributed through the range
of the bucket. If they truly are, then you can do things like get
a good estimate of the average value by just taking the midpoint
of each bucket, and so people will say that you don't really need
the true total. However, this is an assumption and it's not necessarily
correct, especially if the size of the buckets is large (as it can
be at the upper end of a 'powers of two' logarithmic bucket size
scheme, which is pretty common because it's convenient to generate).</p>
<p>I've certainly looked at a number of such histograms where it's
clear (from various other information sources) that this assumption
of even distribution wasn't correct. How incorrect it was wasn't
all that clear, though, because the information necessary to have
a solid idea wasn't there.</p>
<p>Good histogram data takes more than counts in buckets. But including a
true total as an additional piece of data is at least a start, and it's
probably inexpensive (both to export and to accumulate).</p>
<p>(Someone has probably already written a 'best practices for gathering
and providing histogram data' article.)</p>
</div>
Histogram data is most useful when they also provide true totals2024-02-26T21:43:52Z2024-01-27T03:41:23Ztag:cspace@cks.mef.org,2009-03-24:/blog/programming/GoAvoidingAnyAsATypecks<div class="wikitext"><p>As modern Go programmers know, when Go introduced generics it also
introduced a new '<code>any</code>' type. This is <a href="https://go.dev/ref/spec#Interface_types">officially documented</a> as:</p>
<blockquote><p>For convenience, the predeclared type <code>any</code> is an alias for the empty
interface.</p>
</blockquote>
<p>The 'any' type (alias) exists because it's extremely common in code
that's specifying generic types to want to be able to say 'any
type', and the way this is done in generics is 'interface{}', the
empty interface. This makes generic code clearly easier to read and
follow. Consider these two versions of the signature of <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/Go122ReflectTypeFor">reflect.TypeFor</a></p>
<blockquote><pre style="white-space: pre-wrap;">
func TypeFor[T any]() Type
func TypeFor[T interface{}]() Type
</pre>
</blockquote>
<p>These are semantically equivalent but the first is clearer, because
you don't have to remember this special case of what 'interface{}'
means. Instead, it's right in the name 'any' (and there's less
syntactic noise).</p>
<p>But after Go generics became a thing, there's been a trend of
using this new '<code>any</code>' alias outside of generic types, instead of
writing out 'interface{}'. I don't think this is a good idea.
To show why, consider the following two function signatures,
both of which use 'any':</p>
<blockquote><pre style="white-space: pre-wrap;">
func One[T any](v T) bool
</pre>
<pre style="white-space: pre-wrap;">
func Two(v any) bool
</pre>
</blockquote>
<p>These two function signatures look almost the same, but they have
wildly different meanings, even if (or when) they're invoked with
the same argument. The effects of '<code>One(10)</code>' are rather different
from '<code>Two(10)</code>', since 'One' is a generic function while 'Two' is
a regular one. Now consider them written this way:</p>
<blockquote><pre style="white-space: pre-wrap;">
func One[T any](v T) bool
</pre>
<pre style="white-space: pre-wrap;">
func Two(v interface{}) bool
</pre>
</blockquote>
<p>Now we see clearly what <code>Two()</code> is doing differently than <code>One()</code>;
it's obvious that it isn't taking 'any type' as such, but instead
it's taking a generic interface as the argument type. This makes
it obvious that a non-interface value will be converted to an
interface value (and will tell some people that <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GoNilIsTypedSortOf">an interface value
will lose its interface type</a>).</p>
<p>This increased immediate clarity without needing to remember what
'<code>any</code>' is why I'm planning to use 'interface{}' in my code in
the future, and why I think you should too. Yes, '<code>any</code>' is shorter
and it has a well defined meaning in the specification and we can
probably remember the special meaning all of the time. But why give
ourselves that extra cognitive burden when we can be explicit?</p>
<p>(In generics, the argument goes the other way; 'any' really does
mean 'any type', and the 'any' name is clearer than writing
'interface{}' and then needing to remember that that's how generics
do it.)</p>
<p>In a sense the 'any' name is a misnomer when used as a type. It's true
that 'interface{}' will accept any type, but used as a type, it doesn't
mean 'any type'; it means specifically the type 'an empty interface',
which is to say an interface that has no methods, which implies
interface type conversion (unless you already have an 'interface{}'
value). Since 'any' does mean 'any type' in the context of generics,
I think it's better to use a different name for each thing, even if
Go formally makes the names equivalent. The names of things are
fundamentally for people.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GoAvoidingAnyAsAType?showcomments#comments">6 comments</a>.) </div>In Go, I'm going to avoid using '<code>any</code>' as an actual type2024-02-26T21:43:52Z2024-01-26T04:03:50Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/CPUIGPCoolingAdvantagecks<div class="wikitext"><p>Once upon a time, you could readily get basic graphics cards,
generally passively cooled and certainly single-width even if they
had to have a fan in order to get you dual output support; this is,
for example, more or less what I had in <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2011">my 2011 era machines</a>. These days these cards are mostly
extinct, so when I put together <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">my current office desktop</a> I wound up with a dual width, definitely
fan-equipped card that wasn't dirt cheap. For some time I've been
grumpy about this, and sort of wondering where they went.</p>
<p>The obvious answer for where these cards went is that CPUs got
integrated graphics (although not all CPUs, especially higher end
ones, so <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">you could wind up using a CPU without an IGP and needing
a discrete GPU</a>. When thinking about why
integrated graphics displaced such basic cards, it recently struck
me that one practical advantage integrated graphics has is cooling.</p>
<p>The integrated graphics circuitry is part of the CPU, or at least
on the CPU die. General use CPUs have been actively cooled for well
over a decade now, and for a long time they've been the focus of
high performance cooling and sophisticated thermal management. The
CPU is probably the best cooled thing in a typical desktop (and it
needs to be). Cohabiting with this heat source constrains the <a href="https://en.wikipedia.org/wiki/Graphics_processing_unit#Integrated_graphics_processing_unit">IGP</a>,
but it also means that the IGP can take advantage of the CPU's
cooling to cool itself, and that cooling is generally quite good.</p>
<p>A discrete graphics card has no such advantage. It must arrange its
own cooling and its own thermal management, both of which cost money
and the first of which takes up space (either for fans or for passive
heatsinks). This need for its own cooling makes it less competitive
against integrated graphics, probably especially so if the card is
trying to be passively cooled. I wouldn't be surprised if the options
were a card that didn't even compare favorably to integrated graphics
or a too-expensive card for the performance you got. There's also
the question of whether the discrete GPU chipsets you can get are
even focused on low power usage or whether they're designed to
assume full cooling to allow performance that's clearly better than
integrated graphics.</p>
<p>(Another limit, now that I look, is the amount of power available
to a PCIe card, especially one that uses fewer than 16 PCIe lanes;
apparently a x4 or x8 card may be limited to 25W total (with an x16
going to 75W), <a href="https://en.wikipedia.org/wiki/PCI_Express#Power">per Wikipedia</a>. However, I don't
know how this compares to the amount of power an IGP is allowed to
draw, especially in CPUs with more modest overall power usage.)</p>
<p>The more I look at this, the more uncertainties I have about the
thermal and power constraints that may or may not face discrete GPU
cards that are aiming for low cost while still offering, say,
multi-monitor support. I imagine that the readily available and more
or less free cooling that integrated graphics gets doesn't help the
discrete GPUs, but I'm not sure how much of a difference it really
makes.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/CPUIGPCoolingAdvantage?showcomments#comments">5 comments</a>.) </div>The cooling advantage that CPU integrated graphics has2024-02-26T21:43:52Z2024-01-25T03:22:06Ztag:cspace@cks.mef.org,2009-03-24:/blog/web/CGIOneStepDeploymentcks<div class="wikitext"><p>When I wrote about <a href="https://utcc.utoronto.ca/~cks/space/blog/web/CGINotSlow">how CGI programs aren't particularly slow these
days</a>, one of the reactions I saw was to suggest that
one might as well use a <a href="https://en.wikipedia.org/wiki/FastCGI">FastCGI</a>
system to run your 'CGI' as a persistent daemon, saving you the
overhead of starting a CGI program on every request. One of the
practical answers is that FastCGI doesn't have as simple a deployment
model as CGIs generally offer, <a href="https://utcc.utoronto.ca/~cks/space/blog/web/CGIAttractions">which is part of their attractions</a>.</p>
<p>With many models of CGI usage and configuration, installing a CGI,
removing a CGI, or updating it is a single-step process; you copy
a program into a directory, remove it again, or update it. The web
server notices that the executable file exists (sometimes with a
specific extension or whatever) and runs it in response to requests.
This deployment model can certainly become more elaborate, with you
directing a whole tree of URLs to a CGI, but it doesn't have to be;
you can start very simple and scale up.</p>
<p>It's theoretically possible to make FastCGI deployment almost as
simple as the CGI model, but I don't know if any FastCGI servers
and web servers have good support for this. Instead, FastCGI and
in general all 'application server' models almost always require
at least a two step configuration, where you to configure your
application in the application server and then configure the URL
for your application in your web server (so that it forwards to
your application server). In some cases, each application needs a
separate server (FastCGI or whatever other mechanism), which means
that you have to arrange to start and perhaps monitor a new server
every time you add an application.</p>
<p>(I'm going to assume that the FastCGI server supports reliable and
automatic hot reloading of your application when you deploy a change
to it. If it doesn't then that gets more complicated too.)</p>
<p>If you have a relatively static application landscape, this multi-step
deployment process is perfectly okay since you don't have to go
through it very often. But it is more involved and it often requires
some degree of centralization (for web server configuration updates,
for example), while it's possible to have a completely distributed
CGI deployment model where people can just drop suitably named
programs into directories that they own (and then have their CGI
run as themselves through, for example, Apache suexec). And, of
course, it's more things to learn.</p>
<p>(CGI is not the only thing in the web language landscape that has
this simple one step deployment model. PHP has traditionally had it
too, although my vague understanding is that people often use PHP
application servers these days.)</p>
<p>PS: At least on Apache, CGI also has a simple debugging story; the
web server will log any output your CGI sends to standard error in
the error log, including any output generated by a total failure
to run. This can be quite useful when inexperienced people are
trying to develop and run their first CGI. <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/LighttpdCGIStderr">Other web servers can
sometimes be less helpful</a>.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/web/CGIOneStepDeployment?showcomments#comments">4 comments</a>.) </div>CGI programs have an attractive one step deployment model2024-02-26T21:43:53Z2024-01-24T03:55:08Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/MotherboardFeaturesPCIeCostscks<div class="wikitext"><p>My current <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/WorkMachine2017">office desktop</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2018">home
desktop</a> are now more than five years old
(although they've had some storage tuneups since then), so I've
been looking at PC hardware off and on. As it happens, PC desktop
motherboards that have the features I'd like also not infrequently
include extra features that I don't need, such as built in wifi
connectivity. I'm somewhat of a hardware minimalist so in the past
I've reflexively attempted to avoid these features. The obvious
reason to do this is that they tend to increase the cost. But lately
it's struck me that there's another reason to want a desktop PC
motherboard without extra features, and that is <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/PCIeAndModernCPUs">PCIe lanes</a>.</p>
<p>Processors (CPUs) and motherboard chipsets only have so many PCIe
lanes in total, partly because supporting more PCIe lanes is one
of those product features that both Intel and AMD use to segment
the market. This matters because days, almost everything built
into a PC motherboard is actually implemented as a PCIe device,
which means that it normally consumes some number of those PCIe
lanes. The more built in devices your motherboard has, the more
PCIe lanes they consume out of the total ones available, which can
cut down on other built in devices and also on connectivity you
want, such as <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/M2SSDsAndNVMe">NVMe drives</a> and physical PCIe card
slots. <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/PCIeSlotsLimitations">Physical PCIe slots can already have peculiar limitations
on which ones can be used together</a>, which
has the effect of reducing the total PCIe lanes they consume, but
you generally can't play very many of these games with built in
hardware.</p>
<p>(You can play some games; on my <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2018">home desktop</a>, the motherboard's
second NVMe slot shares two PCIe lanes with some of my SATA ports. If
I want to run the NVMe drive with x4 PCIe lanes instead of x2, I can
only have four SATA ports instead of six.)</p>
<p>Of course, all of this is academic if you can only find the motherboard
features you want on higher end motherboards that also include these
extra features. Provided that there aren't any surprise limitations
that affect things you're going to use right away, you (I) just get
to live with whatever limitations and constraints on PCIe lane usage
you get, or you have to drop some features you want. This is where
you have to read motherboard descriptions quite carefully, including
all of the footnotes, and perhaps even consult their manuals.</p>
<p>(What features I want is another question, and there are tradeoffs
I could make and may have to.)</p>
<p>Fortunately (given the growth of things like NVMe drives), the
number of PCIe lanes available from CPUs and chipsets has been going
up over time, as has their speed. However I suspect that we're
always going to see Intel and AMD differentiate their server
processors from their desktop processors partly by the number of
PCIe lanes available, with the 'desktop' processors having the
smaller number. My impression is that AMD desktop CPUs have more
CPU PCIe lanes than Intel desktop CPUs and also I believe more
chipset PCIe lanes, but Intel is potentially ahead on PCIe bandwidth
between the chipset and the CPU (and thus between chipset devices
and RAM, which has to go through the CPU). Whether you'll ever
stress the CPU to chipset bandwidth that hard is another question.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/MotherboardFeaturesPCIeCosts?showcomments#comments">One comment</a>.) </div>Desktop PC motherboards and the costs of extra features2024-02-26T21:43:52Z2024-01-23T04:29:05Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/PrometheusRangeVectorGapSizecks<div class="wikitext"><p>In <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusDeltaExtrapolation">yesterday's example of <code>delta()</code> extrapolating to cover a full
time range</a>, we saw that an example
fifteen minute range vector for a metric actually covered a time
range less than fifteen minutes. In fact, it covered fifteen seconds
less than fifteen minutes, and the scrape interval for the metric
in question was fifteen seconds. In thinking about it, I've realized
that this isn't a coincidence and in fact I believe that nearly all
of the time, many range vectors for many time ranges will actually
cover that time range less one scrape interval for the metric in
question, whatever that scrape interval is. Specifically, any time
range that is a multiple of the scrape interval will likely behave
this way.</p>
<p><a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusScrapeIntervalBit">As we've seen before</a>, Prometheus
randomizes the scrape time for any particular target. If you scrape
a target every fifteen seconds, it will almost never be scraped at
x:00, x:15, x:30, and x:45; instead it will almost always be scraped
at some constant offset that varies from target to target. This is
sensible behavior to keep all of your scrape targets from being
hammered like clockwork at the start of every minute, but it also
means that a range query will almost never align exactly with the
scrape times for a particular target.</p>
<p>If the scrape times aren't aligned with the range query's start and
end time but the scrape interval evenly divides the range (for
example, a 15 minute range and a 15 second scrape interval), then
the first time series point will have some offset after the start of the
range and the last point will have some offset before it. If the
scrape durations are consistent and low (as they often are), what
we have is a situation where the timeline of scrapes is offset from
the range's timeline by some amount. The first time series point is
'late' (from the range's perspective) by this amount, and then what
would be the first time series point after the range is also 'late' by
that same amount, which means that the last point within the range
is 'early' by the scrape interval minus the offset.</p>
<p>Let's make this concrete. Imagine a one minute range vector and a
15 second scrape interval where the scrapes happen at 0:07, 0:22,
0:37, 0:52, and 1:07 (relative to the range). The scrapes are 'late'
by seven seconds relative to the range vector, but the last time
series point at 1:07 is outside the one minute range vector, leaving
us with the last included point being 15 seconds before it and
(15-7) or 8 seconds before the end of the range vector.</p>
<p>A more complex and less predictable situation happens if the range
is not a multiple of the scrape interval. This can happen either
because your scrape interval is not even and your ranges are, or
your scrape interval is even but your ranges vary widely, for example
if they're set by the step resolution of a Grafana graph panel
('<a href="https://grafana.com/docs/grafana/latest/dashboards/variables/add-template-variables/#__interval"><code>$__interval</code></a>'
in Grafana dashboard jargon, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaOurIntervalSettings">which is a sensible interval setting</a>). Locally, we have a lot of slower
scrape intervals that are prime numbers so that they deliberately
don't get into some fixed alignment with wall clock time; these
are fairly unlikely to line up with range intervals this way.</p>
<p>(Possibly this is obvious to everyone but me but it felt a little
bit surprising to me when it came up <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusDeltaExtrapolation">yesterday</a>, so I want to write it down so I
remember it in the future.)</p>
<p>PS: Where this may be an issue even for us is in alert rules, which
may well use fixed, nice-number range durations (like '1m' or '5m')
and draw from metrics sources, such as the host agent, that we
scrape every fifteen seconds.</p>
</div>
The expected size of a gap in a Prometheus range vector (sometimes)2024-02-26T21:43:53Z2024-01-22T04:24:27Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/PrometheusDeltaExtrapolationcks<div class="wikitext"><p>Recently, someone came to the Prometheus mailing list with an
interesting issue they were having, where they were using Prometheus's
<a href="https://prometheus.io/docs/prometheus/latest/querying/functions/#delta"><code>delta()</code></a>
function to look at the amount of change over some time range, but
were getting results that they didn't expect. They had a relatively
slow changing metric where they could look at the value at the start
of a fifteen minute time interval, the value at the end of it, and
the delta() result for 'metric[15m]', but the delta() result didn't
equal the difference they saw; it was instead visibly higher. To
make things more confusing, this was a frequently scraped metric,
collected every fifteen seconds.</p>
<p>What is happening is explained by an innocent sounding sentence
in the documentation for <a href="https://prometheus.io/docs/prometheus/latest/querying/functions/#delta"><code>delta()</code></a>:</p>
<blockquote><p>The delta is extrapolated to cover the full time range as specified in
the range vector selector, so that it is possible to get a non-integer
result even if the sample values are all integers.</p>
</blockquote>
<p>(Similar wording is in the documentation for both <a href="https://prometheus.io/docs/prometheus/latest/querying/functions/#increase"><code>increase()</code></a>
and everyone's favorite, <a href="https://prometheus.io/docs/prometheus/latest/querying/functions/#rate"><code>rate()</code></a>.)</p>
<p><a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGetRawMetrics">How you get raw time series data from Prometheus</a>, including its timestamps, is with an
<a href="https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries">instant query</a>
that gives you a <a href="https://prometheus.io/docs/prometheus/latest/querying/basics/#range-vector-selectors">range vector</a>
as the result, such as 'metric[15m]'. You can do this in the web
interface or via 'promtool query instant', and the person asking
for help shared the results of their query:</p>
<blockquote><pre style="white-space: pre-wrap;">
promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'metric[15m]'
9732212 @[1705586407.092]
[...]
9848219 @[1705587292.092]
</pre>
</blockquote>
<p>The true difference between the first and the last metric point is
116007, but delta() reported its result as '117973.22033898304':</p>
<blockquote><pre style="white-space: pre-wrap;">
promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'delta(metric [15m])'
{} => 117973.22033898304 @[1705587300]
</pre>
</blockquote>
<p>Surprisingly, this is what you would actually expect from the query
results, and we can work through this from the raw information we
have. The fifteen minute time range covers 14:00:00 UTC to 14:15:00
UTC, but the first actual time series point in it was from 14:00:07
UTC and the last was from 14:14:52. This means that delta() will
extrapolate out to cover 15 more seconds than the range vector
covers (7 seconds at the start, 8 seconds at the end), and the range
vector itself covers 15 minutes less 15 seconds, or 885 seconds.</p>
<p>(We can also get the coverage of the range vector from subtracting
the first timestamp from the last one; this also gives us 885
seconds.)</p>
<p>Turning to bc (or the calculator of your choice), we can calculate
first the scaling factor of the extrapolation and then the actual
numerical result:</p>
<blockquote><pre style="white-space: pre-wrap;">
$ bc -l
(15*60) / 885
1.01694915254237288135
(( 15 * 60 ) / 885 ) * 116007
117973.22033898305084676945
</pre>
</blockquote>
<p>It certainly feels weird that a mere fifteen second gap in a
(nominally) fifteen minute range can cause such a clear difference,
but that's how it works out. The absolute difference will be smaller
if the numbers involved are smaller, but for a given gap, the ratio
of the difference will always be the same.</p>
<p>(You might also feel that a fifteen second scrape interval should
be fast enough to avoid this sort of issue but again, it's clearly
not the case on a fifteen minute range, or even a smaller one. This
may especially be an issue if you're doing <code>rate()</code> on relatively
small time ranges as part of a Grafana graph.)</p>
</div>
An example of how Prometheus's delta() function will extrapolate time ranges2024-02-26T21:43:53Z2024-01-21T03:48:40Ztag:cspace@cks.mef.org,2009-03-24:/blog/python/DjangoPython3FieldEncodingGotchacks<div class="wikitext"><p>We have <a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoORMDesignPuzzleII">a long standing Django web application</a>,
which has been static for a long time in Python 2. I've recently
re-started working on moving it to Python 3 (<a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoAppPython3Surprise">an initial round of
work was done some time ago</a>), and in the
process of this I ran into a surprising issue involving text encoding
and database text fields (a '<a href="https://docs.djangoproject.com/en/5.0/ref/models/fields/#charfield">CharField</a>' in
Django terminology).</p>
<p>As part of one database model, we have <a href="https://utcc.utoronto.ca/~cks/space/blog/web/KeyPlusAuthenticator">a random key</a>, which for various reasons is represented
in the database model as <a href="https://docs.djangoproject.com/en/5.0/ref/models/fields/#charfield">a modest size string</a>:</p>
<blockquote><pre style="white-space: pre-wrap;">
class Request(models.Model):
[...]
access_hash = models.CharField(..., default=gen_random_hash)
</pre>
</blockquote>
<p>(We use SQLite as our database, which may be relevant here.)</p>
<p>The actual access hash is a 64-bit random value read from /dev/urandom.
We could represent this in a variety of ways; for instance, I could
have just treated it as a 64-bit unsigned decimal number in string
form, or a 64-bit (unsigned) hex number. But for no particularly
strong reason, long ago I decided to base64 encode the raw random
value. Omitting error checking, the existing version of this is:</p>
<blockquote><pre style="white-space: pre-wrap;">
def gen_random_hash():
fp = open("/dev/urandom", "rb")
c = fp.read(8)
fp.close()
# Trim off trailing awkward '=' character
return base64.urlsafe_b64encode(c)[:-1]
</pre>
</blockquote>
<p>(The use of "rb" as the open mode stems from <a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoAppPython3Surprise">my first round of
updates for Python 3</a>.)</p>
<p>When I ran our web application under Python 3 in testing mode and
looked at the uses of this access hash, I discovered that the URLs
we were generating for it in email included a couple of '&#x27;' in
them. Inspection of the database table entry itself in the Django
admin interface showed that the actual value for access hash was,
for example (and this is literal):</p>
<blockquote><pre style="white-space: pre-wrap;">
b'm5AWGlUSR1c'
</pre>
</blockquote>
<p>(0x27 is ', so I was getting "b&#x27;m5AWGlUSR1c&#x27" in the URL
when written out for email in HTML. Once I looked I could see that
giveaway leading 'b' too.)</p>
<p>There are two things going on here. The first is that in Python
3, <a href="https://docs.python.org/3/library/base64.html#base64.urlsafe_b64encode">base64.urlsafe_b64encode</a>
operates on bytes (which we're giving it since we read /dev/urandom
in binary mode, making <code>c</code> a bytes object) and returns a bytes
object, not a string. The second thing is that when we ask Django to
store a bytes object in a CharField (possibly only as the result of
<a href="https://docs.djangoproject.com/en/5.0/ref/models/fields/#django.db.models.Field.default">a callable default value</a>),
Django string-ized it, yielding the b'...' form as the stored value.</p>
<p>At one level this is reasonable; Django is doing its best to store
some random Python object into a string field, probably by just
doing a str() on it. At another level, I wish Django would specifically
refuse to do this conversion for bytes objects, because one Python
3 issue is definitely bytes/str confusions and this specific
representation conversion is almost certainly a bad idea, unlike
str()'ing things in general. Raising an exception by default would
be much more useful.</p>
<p>The solution is to explicitly convert to a Unicode str and specify
a suitable character set encoding, which for base64 can be 'ascii':</p>
<blockquote><pre style="white-space: pre-wrap;">
return str(base64.urlsafe_b64encode(c)[:-1], "ascii")
</pre>
</blockquote>
<p>This causes the CharField values to look like they should, which
means that URLs using the access hash no longer have '&#x27;' in
them.</p>
<p>Hopefully there aren't any other cases of this lurking in our Django
web application, but I suppose I should do some more testing and
examine the database for alarming characters (which is relatively
readily done with the management <a href="https://docs.djangoproject.com/en/5.0/ref/django-admin/#dumpdata">dumpdata</a>
command).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/python/DjangoPython3FieldEncodingGotcha?showcomments#comments">2 comments</a>.) </div>A Django gotcha with Python 3 and the encoding of CharFields2024-02-26T21:43:53Z2024-01-20T03:36:51Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/PSIIRQNumbersAndMeaningscks<div class="wikitext"><p>For some time, the Linux kernel has had both general and per-cgroup
'<a href="https://www.kernel.org/doc/html/latest/accounting/psi.html">Pressure Stall Information</a>', which
is intended to tell you something about when things on your system
are stalling on various resources. The initial implementation
provided this information for cpu usage, obtaining memory, and
waiting on IO, as I wrote up in <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PSINumbersAndMeanings">my notes on PSI</a>.
In kernel 6.1, an additional PSI file was added, 'irq' (if your
kernel is built with CONFIG_IRQ_TIME_ACCOUNTING, which current
Fedora kernels are).</p>
<p>One important reference for this is <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=52b1364ba0b105122d6de0e719b36db705011ac1">the kernel commit that added
this feature</a>.
Another is Eva Lacy's <a href="https://www.lacy.ie/technology/2023/10/22/pressure-stall-information.html">Pressure Stall Information in Linux</a>.
However, both of these can be a little opaque about what's actually
being calculated and reported in 'irq'.</p>
<p>The /proc/pressure/irq file will typically look like the other pressure
files, with the exception that it only has a 'full' line:</p>
<blockquote><pre style="white-space: pre-wrap;">
full avg10=0.00 avg60=0.00 avg300=0.00 total=3753500244
</pre>
</blockquote>
<p><a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PSINumbersAndMeanings">As usual</a>, the 'total=' number is the
cumulative time in microseconds that tasks have been stalled on IRQ
or soft IRQs. What 'stalled' means here is that at the end of every
round of IRQ and softirq handling, the kernel works out the total
amount of time that it spent doing this (the 'delta time' in the
commit message), looks to see if there's a meaningful current task
(I believe 'on this CPU'), and if there is, the time is added to
'total'.</p>
<p>There is no 'some' line for the inverse reason of <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PSICpuWhyNoFull">why there's
no 'full' line in the global 'cpu' pressure file</a>.
In the CPU case, there's always something running (globally), so
you can't have a complete stall on CPU the way you can have on
memory or IO, where all tasks could be waiting to get more memory
or have their IO complete. In the case of IRQ handling, either there
was no task running (on the CPU), in which case nothing is impeded
by the IRQ handling time, or there was a task running at the time
the IRQ handling happened, in which case it completely stalled for
the duration.</p>
<p>If I'm understanding all of this correctly, one corollary is that
'irq' pressure only happens to the extent that your system is busy.
Given a fixed amount of time spent handling IRQs and softirqs, the
amount of that time that shows up in /proc/pressure/irq depends on
how often it's interrupting a (running) task, which depends on how
many running tasks you have. On an idle system, the IRQ and softirq
time isn't preempting anything and it's 'free', at least from the
perspective of the PSI system.</p>
<p>Based on reading <a href="https://man7.org/linux/man-pages/man5/proc.5.html">proc(5)</a>, you can get
the total amount of time that the system has spent handling IRQs
and softirqs from the 6th and 7th numbers on the first 'cpu' line
in /proc/stat (the 6th number will be zero if IRQ time accounting
isn't enabled for your kernel). On most machines, this will be in
units of 100ths of a second. You can then cross-compare this to the
total in /proc/pressure/irq. On my home Fedora machine (the one the
sample line comes from), the irq pressure time is about 3% of the
total IRQ handling time; on my work desktop, it's currently about
6%.</p>
<p>(I suspect that all of this means that /proc/pressure/irq won't be
very interesting on many systems, which is good because tools like
<a href="https://github.com/prometheus/node_exporter">the Prometheus host agent</a>
may not have been updated to report it.)</p>
<p>PS: Ubuntu 22.04 kernels don't set CONFIG_IRQ_TIME_ACCOUNTING,
although they're too old to have /proc/pressure/irq. As far as I
can tell, this is still the case in the future 24.04 kernel (<a href="https://en.wikipedia.org/wiki/Ubuntu_version_history">'Noble
Numbat'</a>, and
thus 'noble' on places like <a href="https://packages.ubuntu.com/">packages.ubuntu.com</a>). This is potentially a little bit
unfortunate, but <a href="https://tanelpoder.com/posts/linux-hiding-interrupt-cpu-usage/">it's apparently been this way for some
time</a>.</p>
</div>
Notes on the Linux kernel's 'irq' pressure stall information and meaning2024-02-26T21:43:53Z2024-01-19T03:24:00Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/CgroupV2InterestingMetricscks<div class="wikitext"><p>In <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusExporters-2023">my roundup of what Prometheus exporters we use</a>, I mentioned that we didn't have
a way of generating resource usage metrics for systemd services,
which in practice means <a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html">unified cgroups (cgroup v2)</a>. This
raises the good question of what resource usage and performance
metrics are available in cgroup v2 that one might be interested in
collecting for systemd services.</p>
<p>You can want to know about resource usage of systemd services (or
more generally, systemd units) for a variety of reasons. Our reason
is generally to find out what specifically is using up some resource
on a server, and more broadly to have some information on how much
of an impact a service is having. I'm also going to assume that all
of the relevant cgroup resource controllers are enabled, which is
increasingly the case on systemd based systems.</p>
<p>In each cgroup, you get the following:</p>
<ul><li><a href="https://utcc.utoronto.ca/~cks/space/blog/linux/PSINumbersAndMeanings">pressure stall information</a> for CPU,
memory, IO, and these days IRQs. This should give you a good idea of
where contention is happening for these resources.<p>
</li>
<li><a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#cpu-interface-files">CPU usage information</a>,
primarily the classical count of user, system, and total usage.<p>
</li>
<li><a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#io-interface-files">IO statistics (if you have the right things enabled)</a>,
which are enabled on some but not all of our systems. For us, this
appears to have the drawback that it doesn't capture information
for NFS IO, only local disk IO, and it needs decoding to create
useful information (ie, information associated with a named device,
which you find out the mappings for from /proc/partitions and
/proc/self/mountinfo).<p>
(This might be more useful for virtual machine slices, where it
will probably give you an indication of how much IO the VM is doing.)<p>
</li>
<li><a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory">memory usage information</a>,
giving both a simple amount assigned to that cgroup ('<code>memory.current</code>')
and a relatively detailed breakdown of how much of what sorts of memory
has been assigned to the cgroup ('<code>memory.stat</code>'). As I've found out
repeatedly, the simple number can be misleading depending on what you
want to really know, because it includes things like <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CgroupsMemoryUsageAccounting">inactive file
cache</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/CgroupsMemoryUsageAccountingII">inactive, reclaimable kernel
slab memory</a>.<p>
(You also get swap usage, in '<code>memory.swap.current</code>', and there's
also '<code>memory.zswap.current</code>'.)<p>
In a Prometheus exporter, I might simply report all of the entries in
<code>memory.stat</code> and sort it out later. This would have the drawback of
creating a bunch of time series, but it's probably not an overwhelming
number of them.</li>
</ul>
<p>Although the cgroup doesn't directly tell you how many processes
and threads it contains, you can read '<code>cgroup.procs</code>' and
'<code>cgroups.threads</code>' to count how many entries they have. It's
probably worth reporting this information.</p>
<p>The root cgroup has some or many of these files, depending on your
setup. Interestingly, in Fedora and Ubuntu 22.04, it seems to have
an '<code>io.stat</code>' even when other cgroups don't have it, although I'm
not sure how useful this information is for the root cgroup.</p>
<p>Were I to write a systemd cgroup metric collector, I'd probably
only have it report on first level and second level units (so
'systemd.slice' and then 'cron.service' under systemd.slice). Going
deeper than that doesn't seem likely to be very useful in most cases
(and if you go into user.slice, you have cardinality issues). I
would probably skip '<code>io.stat</code>' for the first version and leave it
until later.</p>
<p>PS: I believe that some of this information can be visualized live
through <a href="https://www.freedesktop.org/software/systemd/man/systemd-cgtop.html">systemd-cgtop</a>.
This may be useful to see if your particular set of systemd services
and so on even have useful information here.</p>
</div>
Some interesting metrics you can get from cgroup V2 systems2024-02-26T21:43:53Z2024-01-18T03:40:46Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/PrometheusExporters-2023cks<div class="wikitext"><p><a href="https://support.cs.toronto.edu/">We</a> have a fairly basic and
straightforward <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">Prometheus and Grafana setup</a>, but over time we've drifted into
using a number of <a href="https://prometheus.io/">Prometheus</a> <em>exporters</em>,
which is the Prometheus term for things that provide or generate
metrics. Today I feel like listing them off as a snapshot of our
current practices and what we've found useful in our particular and
somewhat peculiar environment.</p>
<ul><li><a href="https://github.com/prometheus/node_exporter">node_exporter</a> is the
standard Prometheus host agent. We run the latest binary release on our
Linux machines and the packaged OpenBSD version on our OpenBSD machines.<p>
We have a whole collection of scripts that collect various host
specific metrics and push them out through the node exporter's
'textfile' collector; an inventory of these doesn't fit within the
margins of this entry.<p>
</li>
<li>The <a href="https://github.com/prometheus/blackbox_exporter">Blackbox exporter</a>
is the standard solution for probing machines and services from
the outside, and for collecting TLS certificate information. We
use it for a variety of these checks across ICMP ping, port connections,
HTTP checks, and DNS lookups.<p>
</li>
<li><a href="https://github.com/prometheus/pushgateway">Pushgateway</a> is what we
use to publish assorted bits of information, some of it for historical
reasons.<p>
</li>
<li><a href="https://github.com/Lusitaniae/apache_exporter">apache_exporter</a> is
how we scrape basic statistics from our collection of Apache web
servers. We run it on our central metrics server rather than having
it running on each Apache web server for obscure reasons.<p>
</li>
<li><a href="https://github.com/ricoberger/script_exporter">script_exporter</a> is
how we use arbitrary scripts to generate metrics. We use these scripts
to perform more intricate service checks than Blackbox supports (letting
us check IMAP, Authenticated SMTP, Samba servers, and more) and to pull
more complicated information. <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusScriptExporterWhy">I prefer the script exporter to the other
options for this</a>.<p>
</li>
<li>We run a locally hacked version of <a href="https://github.com/mindprince/nvidia_gpu_prometheus_exporter">nvidia_exporter</a>
on our NVIDIA™ GPU <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SlurmHowWeUseIt">SLURM nodes</a>. It's somewhat
handy for providing usage metrics to tell us how actively the GPUs
are being used, how much of their memory gets allocated, and so on.<p>
(We have one machine with AMD GPUs, for which we use a hacked up
version of <a href="https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/files/usr/local/bin/prometheus-amd-rocm-stats.py">code I dug out of the depths of Wikipedia's metrics
systems</a>;
this code is a node exporter textfiles thing, not a separate exporter.)<p>
</li>
<li>A few machines run <a href="https://github.com/google/mtail">Google's mtail</a> to
extract structured information from various logs. These days it's only used
for Exim logs for mail metrics.<p>
(Grafana Loki's promtail component can generate (some) metrics
from logs, but <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLokiSimpleNotRecommended">I'm not really enthused about Loki these days</a> and anyway we were using mtail
before Loki existed.)<p>
</li>
<li>We use <a href="https://github.com/fffonion/tplink-plug-exporter">tplink-plug-exporter</a> to give us some
additional information from <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OurWifiStatusMonitoring">the wifi-controlled smart plugs we use
to monitor our wifi</a>. In theory this gives
us information like reported 'RSSI' wifi signal strength, but in
practice we're mostly scraping this because it's there.<p>
</li>
<li>We run the Cloudflare <a href="https://github.com/cloudflare/ebpf_exporter">ebpf_exporter</a> on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our ZFS fileservers</a> and a few other machines to capture
detailed per-disk latency histograms, to help us diagnose potential
disk IO performance problems. We're using an old version of this for
various reasons; someday I need to update to the current version (which
changed how it builds and deploys the eBPF instrumentation) and look for
additional useful information it can collect.<p>
</li>
<li>We also run <a href="https://github.com/siebenmann/zfs_exporter">my fork of zfs_exporter</a> on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our ZFS fileservers</a>
to give us detailed ZFS performance information. Probably too detailed;
since we have a lot of pools and report metrics down to individual disks
(which are actually partitions on physical disks), we get a lot of time
series from this exporter.<p>
</li>
<li><a href="https://github.com/joe-elliott/cert-exporter">cert-exporter</a> is used
to collect TLS information for a few TLS certificates that we can most
conveniently access on disk, instead of through TLS services. These
include, for example, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OpenVPNTLSRootExpirySolution">our OpenVPN TLS certificates</a> (even though they won't expire for
some time, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/TenYearsNotLongEnough">which is a good thing</a>).<p>
</li>
<li><a href="https://github.com/SuperQ/chrony_exporter">chrony_exporter</a> collects
information from the <a href="https://chrony-project.org/">Chrony NTP server</a>,
which we run on both <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/NTPStratumAlertNotUseful">our local NTP servers</a>
and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our ZFS fileservers</a>.<p>
</li>
<li><a href="https://github.com/prometheus-community/bind_exporter">bind_exporter</a>
runs on both our stealth master Bind server and our Bind based resolving
DNS servers (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/UsingBindNowForResolvers">after we switched to Bind</a>. This
gives us metrics about query volume, which is nice, but it especially
gives us all of the zone <a href="https://en.wikipedia.org/wiki/SOA_record">SOA</a>
serial numbers, which lets us raise alerts if things aren't all using the
same version of our DNS zones.</li>
</ul>
<p>As you can see from this list, we (well, I) like running exporters
for things, although there are exporters we're not running for
various reasons. One current big gap in our observability is
per-service resource usage information on our Ubuntu servers. The
information is there in Linux cgroups (and systemd's use of them),
but I haven't found an available exporter that provides the information
I'd like in the form I'd like it.</p>
<p>(It may surprise people to hear that we're not using the SNMP exporter,
but we don't actually have anything we want to poll that's set up to
report stuff over SNMP. In particular, our core network switches aren't
set up for SNMP metrics collection, for historical reasons.)</p>
</div>
What Prometheus exporters we use (as of the end of 2023)2024-02-26T21:43:53Z2024-01-17T03:42:46Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/OurWifiStatusMonitoringcks<div class="wikitext"><p><a href="https://support.cs.toronto.edu/">We</a> have a departmental wifi
network that exists in six buildings on campus (I think, perhaps
more) and uses access points that are operated by other people
(the university wifi people use enterprise access points that support
multiple SSIDs). This leaves us with a little bit of a wifi monitoring
problem, since all we see is basically the very top level of our wifi
network activity (as it's all brought to into our wifi network
gateway). If there's some problem with our SSID on an access point
in some building, or the uplink to us from the building isn't working
for some reason, we've traditionally had to wait for people to
notice and report it to us. A while back, we decided that we would
like to do better, which meant actively monitoring the state of our
wifi network in various locations.</p>
<p>There are at least two broad ways to do this; you can put devices
on the wifi network and then verify that you can reach them over
it, or you can have dual-network devices accessible through another
network but also connecting to the wifi. The former requires one
less network connection but means you risk false positives (when
what's down is your monitoring device, not the wifi network).</p>
<p>If you're a certain sort of person, you're now enthused about the
idea of building your own little Linux-based wifi monitoring
computers, using one of a number of basic single-board computers
with built in wifi, no doubt put in a 3D printed case, and probably
powered over USB (from a cheap USB wall charger). We looked at this
briefly and concluded that we were not enthused. We didn't have
anyone who actively wanted to build such units and we suspected
that they'd be surprisingly costly and wouldn't necessarily be as
reliable and trustworthy as we wanted them to be. So we picked
another option, namely <strong>wifi-controlled power plugs</strong> that were
designed for home automation.</p>
<p>These ticked off more or less all of the boxes we were looking for.
Wifi controlled power plugs are obviously on your wifi network, as
that's how you're nominally supposed to control them, and they have
a simple story of how they're powered, since they just plug into
your wall socket (so you don't have to worry about a separate power
unit or batteries). They're not very big, they're relatively
unobtrusive, they're likely to reliably stay on the wifi, they're
probably not going to burst into flames if you buy a decent brand
(an important attribute for something we would be leaving plugged
into 120V power 24/7), and if you don't have enough outlets, in a
pinch you can plug something into them. And basic units are available
relatively inexpensively, for far less money than it would cost us
to build our own.</p>
<p>There are a variety of other wifi-controlled or wifi-accessible things
that you could also use for this and we looked at some of them. But
most of them have more complicated stories about (long term) power,
such as using batteries that would need changing every six months or
so, and many of them seemed likely to want to send potentially alarming
and intrusive information off to cloud servers (such as the temperature
of their surroundings). If you never actually control power with your
wifi-controlled power plugs, they can't tell a cloud service anything
very interesting.</p>
<p>The specific model we wound up using is the TP-Link HS103 'Kasa Smart
Wifi-Plug' (I believe in the 'Lite' version). These work about as
well for us as we could ask for something in their price range, which
is to say that almost all of the time they stop responding to ICMP
pings, it's because something has happened to the wifi network in their
location. And for buildings that are a long way away from our offices,
they're cheap enough that you can get two and only be alarmed if both
drop off the network at once.</p>
<p>(Some of the time, what happened is that someone unplugged one. We
try to put enough labels on them (and locate them in secure spots)
so that people will neither pull them out or try to use them to
power their devices, but it doesn't always work.)</p>
<p>Update: it turns out that I wrote about the early stages of this
back in late 2022 in <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/WirelessLivenessMonitoring">Monitoring if our wireless network is actually
working in locations</a>. The experimentation
from then has graduated to production status with a half dozen of these
deployed so far.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OurWifiStatusMonitoring?showcomments#comments">2 comments</a>.) </div>How we monitor that our wireless network is still there in places2024-02-26T21:43:53Z2024-01-16T03:19:50Ztag:cspace@cks.mef.org,2009-03-24:/blog/programming/GitBranchesSocialConstructscks<div class="wikitext"><p>Over on the Fediverse, <a href="https://mastodon.social/@cks/111449110724408233">I had a half-baked thesis</a>:</p>
<blockquote><p>A half-baked thesis: branches in Git are a social construct, somewhat
enabled by technical features. We talk about things having been done
on a branch or existing on a branch, or what branches are what on an
intertwined tree of them, even when this is not something you can find
in the Git repository.</p>
<p>(This is since commits aren't permanently associated with a branch;
they are merely currently reachable from one or more branches. What
branch a multi-head-reachable commit is on is up to us.)</p>
</blockquote>
<p>The background on this is more or less Julia Evans' <a href="https://jvns.ca/blog/2023/11/23/branches-intuition-reality/">git branches:
intuition & reality</a>, or
more exactly a Fediverse discussion between <a href="https://social.jvns.ca/@b0rk/111445767832607539">Julia Evans</a> and <a href="https://mathstodon.xyz/@mjd/111446192179078717">Mark Dominus</a> (and Mark Dominus's
<a href="https://blog.plover.com/prog/git/branches.html">I wish people would stop insisting that Git branches are nothing
but refs</a>).</p>
<p>This ties into my long standing view that <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/VCSUINotMathematics">modern version control
systems are a user interface to their underlying mathematics</a>. Git has an internal, mathematical view
of what 'branches' are, but very few people actually use this
mathematical view; instead we use a variety of 'user interface'
views of what branches are and how we think of them. Git supports
these user views of branches with various 'porcelain' features.</p>
<p>Some projects using Git actively work to create branches that have
a more concrete and durable existence. For instance, <a href="https://go.googlesource.com/go/+log/refs/heads/release-branch.go1.21">commits on
a Go release branch have the branch name in the commit's title</a>,
which is something the Go project does for a lot of branches that
are used in Go development and release. For development branches
specifically, this durably marks commits as having been done on the
branch even after the branch is merged to the 'main' development
branch.</p>
<p>Certainly, how I normally think of Git branches is different from
their technical existence, and it differs from branch to branch.
For example, in a typical repository I think of the 'main' branch
as running all the way back to the creation of the repository, but
other branches as only running back to where they split from 'main',
despite this not being technically correct.</p>
<p>(Another sign of Git branches as being a bit socially constructed
is <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GitMasterToMainWithLocalChanges">how you can rename them (per the comments)</a>.)</p>
<p>PS: There are other VCSes where branches have a more durable existence
in the VCS history. These VCSes are neither wrong nor right; my
view is that they've taken a different view of both the UI and the
mathematics of what 'branches' are in their mathematical version
of version control.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GitBranchesSocialConstructs?showcomments#comments">One comment</a>.) </div>Git branches as a social construct2024-02-26T21:43:52Z2024-01-15T02:54:44Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/SelectiveRestoresAndIndexescks<div class="wikitext"><p>Recently we discovered first that <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/AmandaReadsTarRestoresToEnd">the Amanda backup system has
to read some tar archives all the way to the end when restoring a
few files from them</a> and
then <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/AmandaReadsTarRestoresToEndII">sometimes it can do quick restores from tar archives</a>. What is going on is
the general issue of <em>indexed (archive) formats</em>, and also the
potential complexities involved in them in a full system.</p>
<p>To simplify, tar archives are <a href="https://www.gnu.org/software/tar/manual/html_node/Standard.html">a series of entries for files and
directories</a>. Tar
archives contain no inherent index of their contents (unlike some
archive formats, such as <a href="https://en.wikipedia.org/wiki/ZIP_(file_format)">ZIP archives</a>), but you can
build an external index of where each file entry starts and what
it is. Given such an index and its archive file on a storage medium
that supports random access, you can jump to only the directory and
file entries you care about and extract only them. Because tar
archives have not much special overall formatting, you can do this
either directly or you can read the data for each entry, concatenate
it, and feed it to '<code>tar</code>' to let tar do the extraction.</p>
<p>(The trick with clipping out the bits of a tar archive you cared
about and feeding them to tar as a fake tar archive hadn't occurred
to me until I saw what Amanda was doing.)</p>
<p>If tar was a more complicated format, this would take more work and
more awareness of the tar format. For example, if tar archives had
an internal index, either you'd need to operate directly on the raw
archive or you would have to create your own version of the index
when you extracted all of the pieces from the full archive. Why
would you need to extract the pieces if there was an internal index?
Well, one reason is if the entire archive file was itself compressed,
and your external index told you where in the compressed version
you needed to start reading in order to get each file chunk.</p>
<p>The case of compressed archives shows that indexes need to somehow
be for how the archive is eventually stored. If you have an index
of the uncompressed version but you're storing the archive in
compressed form, the index is not necessarily of much use. Similarly,
it's necessary for the archive to be stored in such a way that you
can read only selected parts of it when retrieving it. These days
that's not a given, although I believe many remote object stores
support <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests">HTTP Range requests</a> at
least some of the time.</p>
<p>(Another case that may be a problem for backups specifically is
encrypted backups. Generally the most secure way to encrypt your
backups is to encrypt the entire archive as a single object, so
that you have to read it all to decrypt it and can't skip ahead
in it.)</p>
</div>
Indexed archive formats and selective restores2024-02-26T21:43:52Z2024-01-14T04:28:30Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSZEDOurZedletUsecks<div class="wikitext"><p>One of the components of <a href="https://openzfs.org/wiki/Main_Page">OpenZFS</a>
is the <a href="https://openzfs.github.io/openzfs-docs/man/master/8/zed.8.html">ZFS Event Daemon ('zed')</a>.
Old ZFS hands will understand me if I say that it's the OpenZFS
equivalent of the Solaris/Illumos fault management system as applied
to ZFS; for other people, it's best described as ZFS's system for
handling (kernel) ZFS events such as ZFS pools experiencing disk
errors. Although the manual page obfuscates this a bit, what ZED
does is it runs scripts (or programs in general) from a particular
directory, normally /etc/zfs/zed.d, choosing what scripts to run
for particular events based on their names. OpenZFS ships with a
number of <em>zedlets</em> ('zedlet' is the name for these scripts), and
you can add your own, which we do in <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">our ZFS fileserver environment</a>.</p>
<p>The standard ZED setup supports a number of relatively standard
notification methods, including email; we enable this in our
/etc/zfs/zed.d/zed.rc. The email you get through these standard
notifications is a bit generic but it's a useful starting point
and fallback. Beyond this, we have three additional zedlets we
add:</p>
<ul><li>one zedlet simply syslogs full details about almost all events by doing
almost literally the following:<p>
<blockquote><pre style="white-space: pre-wrap;">
printenv | fgrep 'ZEVENT_' | sort | fmt -999 |
logger -p daemon.info -t 'cslab-zevents'
</pre>
</blockquote>
<p>
ZED has an 'all-syslog.sh' zedlet that's normally enabled, but it
doesn't capture absolutely everything this way and it believes in
reformatting information a bit. We wanted to capture full event
information so we could do as complete a reconstruction of things
as possible later.<p>
</li>
<li>one zedlet syslogs when vdev state changes happen (and what they
are) and immediately triggers <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOurSparesSystemV">our ZFS status reporting and spares
handling system</a>. Because ZED treats individual
disks as vdevs, this is triggered for things like loss of disks and
disk read, write, or checksum errors. Our own system for this will
then email us a report about issues and start any sparing that's
necessary (which will probably result in more email).<p>
</li>
<li>one zedlet syslogs when resilvers complete and triggers a run of
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOurSparesSystemV">our ZFS status reporting and spares handling system</a>. This will report to us when a pool becomes
healthy again and possibly start another round of sparing if we
were holding back to not have too many resilvers happening at once.</li>
</ul>
<p>Because ZED has a hard-coded ten second timeout on zedlets, we have to
run our status reporting and spares handling in the background of the
zedlet, which means we need to use some straightforward shell locking.</p>
<p>The net effect of this setup is that we'll generally get at least
two emails if a disk has problems. One email will be generically
formatted and come from the standard ZED email notification generated
by the various '*-notify.sh' zedlets. The second email comes
from our own ZFS status reporting system, using our own tools to
report and summarize ZFS pool status with informative (for us) disk
names and so on.</p>
<h3>Sidebar: Why we have our own email reporting</h3>
<p>A typical status report can look something like this:</p>
<blockquote><pre style="white-space: pre-wrap;">
Subject: sanhealthmon: details of ZFS pool problems on sanshui
</pre>
<pre style="white-space: pre-wrap;">
Newly degraded pools:
fs16-matter-02 fs16-rahulgk-01 fs16-vision-02
[...]
pool: fs16-rahulgk-01
overall: problems
problems: disk(s) have repaired errors
config:
mirror ONLINE
disk01/0 ONLINE
disk09/0 REPAIRED (errors: 1 read/0 write/0 checksum)
[...]
</pre>
</blockquote>
<p>This is a lot more readable (for us) than decoding the equivalent
in the normal ZFS email, and it also often summarizes the state of
multiple pools if all of them have experienced errors simultaneously
(because, for example, they all use the same physical disk and that
physical disk has had a problem).</p>
</div>
What we use ZFS on Linux's ZED 'zedlets' for2024-02-26T21:43:53Z2024-01-13T03:46:04Ztag:cspace@cks.mef.org,2009-03-24:/blog/unix/InitOldSignalMistakecks<div class="wikitext"><p>Init is the traditional name for the program that is run to be
process ID 1, which is the ultimate ancestor of all Unix processes
and <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/InitHistoricalRoles">historically in charge of managing the system</a>.
Process ID 1 is sufficiently crucial to the system that either it
can't be killed or the system will reboot if it exits (or both, and
<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/InitDeathAndRebootsII">this reboot is a hack</a>). These days on
Linux, PID 1 often isn't literally a binary and process called
'init', but the *BSDs have stuck with an 'init' binary.</p>
<p>Historically there have been a number of reasons for the system
administrator to send signals to init, which you can still see
documented for modern Unixes in places like <a href="https://man.freebsd.org/cgi/man.cgi?init">the FreeBSD <code>init(8)</code>
manual page</a>. One of them
was to reread the list of serial ports to offer login prompts on
and often in the process to re-offer logins on any ports init had
given up on, for example because the serial <code>getty</code> on them was
starting and exiting too fast. Traditionally and even today, this
is done by sending <code>init</code> a SIGHUP signal.</p>
<p><a href="https://utcc.utoronto.ca/~cks/space/blog/unix/KillBySignalNameOrigin">The <code>kill</code> program has supported sending signals by name for a
long time</a>, but sysadmins are lazy and we
tend to have memorized that SIGHUP is signal 1 (and signal 9 is
SIGKILL). So it was not unusual to type this as '<code>kill -1 1</code>',
sending signal 1 (SIGHUP) to process ID 1 (init). However, this
version is a bit dangerous, because it's one extra repeated character
away from a version with much different effects:</p>
<blockquote><p><code>kill -1 <strong>-1</strong></code></p>
</blockquote>
<p>This is only one accidental unthinking repetition of '-1' (instead
of typing '1') away from the version you want. Unfortunately the
change is very bad.</p>
<p>(My view is that using 'kill -HUP 1' makes this much less likely
because now you can't just repeat the '-1', although you can still
reflexively type a '-' in front of both arguments.)</p>
<p>The destination process ID '-1' is very special, especially if
you're root at the time. In both <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/kill.html"><code>kill(1)</code></a> and
the <a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/kill.html"><code>kill(2)</code></a>
system call, using -1 as root means '(almost) all processes on the
system'. So the addition of one extra character, a repeat of one
you were just using, has turned this from sending a SIGHUP signal
to <code>init</code> to sending a SIGHUP to pretty much every user and daemon
process that's currently running. Some of them will have harmless
reactions to this, like re-reading configuration files or re-executing
themselves, but many processes will exit abruptly, including some
number of daemon processes.</p>
<p>Back in the days when you were more likely to be SIGHUP'ing init
in the first place, doing this by accident was not infrequently a
good way to have to reboot your system. Even as recently as a decade
ago, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/AccidentalServerReboot">doing a 'kill -1 -1' as root by accident (for another reason)
was a good way to have to reboot</a>.</p>
<p>(At this point I can't remember if I ever accidentally made this
mistake back in the old days, although <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/AccidentalServerReboot">I have typed 'kill -1 -1'
in the wrong context</a>.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/InitOldSignalMistake?showcomments#comments">3 comments</a>.) </div>An old Unix mistake you could make when signaling <code>init</code> (PID 1)2024-02-26T21:43:52Z2024-01-12T04:06:11Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/MFAIsBothSimpleAndWorkcks<div class="wikitext"><p>Over on the Fediverse <a href="https://mastodon.social/@cks/111564970574073609">I said something a while back</a>:</p>
<blockquote><p>I have a rant bubbling around my head about 'why I'll never enable MFA
for Github'. The short version is that I publish code on GH because
it's an easy and low effort way to share things. So far, managing
MFA and MFA recovery is neither of those; there's a lot of hassles,
worries, work to do, and 'do I trust you to never abuse my phone
number in the future?' questions (spoiler, no).</p>
<p>I'll deal with MFA for work. I won't do it for things I'm only doing
for fun, because MFA makes it not-fun.</p>
</blockquote>
<p><a href="https://utcc.utoronto.ca/~cks/space/blog/tech/MFABasicOptionsIn2023">Basic MFA</a> is ostensibly pretty simple these
days. You get a trustworthy app for your smartphone (that's two strikes
right there), you scan the QR code you get when you enable MFA on your
account, and then afterward you use your phone to generate the MFA
<a href="https://en.wikipedia.org/wiki/Time-based_one-time_password">TOTP</a>
code that you type in when logging in along with your password. That's
a little bit more annoying than the plain password, but think of the
security, right?</p>
<p>But what if your phone is lost, damaged, or unusable because it has
a bulging battery and it's taking a week or two to get your carrier
to exchange it for a new one (which happened to us with our work
phones)? Generally you get some one time use special codes, but now
you have to store and manage them (obviously not on the phone). If
you're cautious about losing access to your phone, you may want to back
up the TOTP QR code and secret itself. Both the recovery codes and
the TOTP secret are effectively passwords and now you need to handle
them securely; if you use a password manager, it may or may not be
willing to store them securely for you. Perhaps you can look into <a href="https://age-encryption.org/">age</a>.</p>
<p>(Printing out your recovery codes and storing the paper somewhere leaves
you exposed to issues like a home fire, which is an occasion where you
might also lose your phone.)</p>
<p>Broadly, <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/MFAAccountRecoveryDistrust">no website has a good account recovery story for MFA</a>. Figuring out how to deal with this
is not trivial and is your problem, not theirs. And while TOTP is
not the in MFA thing these days, the story is in many ways worse
with physical hardware tokens, because you can't back them up at
all (unlike TOTP secrets). Some environments will back up software
<a href="https://en.wikipedia.org/wiki/WebAuthn">Passkeys</a>, but so far
only between the same type of thing and often at the price of
synchronizing things like all of your browser state.</p>
<p>However, all of this is basically invisible in the simple MFA story.
The simple MFA story is that everything magically just works and
that you can turn it on without problems or serious risks. Of course,
websites have a good reason for pushing this story; they want their
users to turn on MFA, for various reasons. My belief is that the
gap between the simple MFA story and the actual work of doing MFA
in a way that you can reliably maintain access to your account is
dangerous, and sooner or later this danger is going to become
painfully visible.</p>
<p>(Like many other versions of <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/SecurityIsPeople">mathematical security</a>,
the simple MFA story invites blaming people (invariably called 'users'
when doing this) when something goes wrong. They should have carefully
saved their backup codes, not lost track of them; they should have
sync'd their phone's TOTP stores to the cloud, or done special export
and import steps when changing phones, or whatever else might have
prevented the issue. This is as wrong as it always is. <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/SecurityIsPeople">Security is not
math, it is people</a>.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/MFAIsBothSimpleAndWork?showcomments#comments">5 comments</a>.) </div>MFA today is both 'simple' and non-trivial work2024-02-26T21:43:52Z2024-01-11T03:49:15Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/MetricsHowFarBackDependscks<div class="wikitext"><p>I mentioned recently in passing (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/TenYearsNotLongEnough">in this entry</a>)
that <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">our Prometheus metrics system</a>
is currently set to keep metrics for 'ten years' (3650 days, which
is not quite ten years given leap days) and we sort of intend to
keep them forever. That got me to thinking about how sensible this
is, and how much usage we have for metrics that go back that far
(we're already keeping over five years of metrics). The best answer
I can come up with is that it depends on what the metrics are for
and about.</p>
<p>The obvious problem with metrics about system performance is that
our systems change over time. As we turn over hardware for the same
hosts, their memory gets bigger, their CPUs get faster, their disks
can improve, their networking will get better, and so on. When we
move to <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/FilesystemCacheAndNFSBandwidth">our new fileserver hardware with much more memory</a>, a lot of the performance
metrics for those machines will probably look different, and in any
case they're going to have a different disk topology. To some extent
this makes it dangerous to compare 'the server called X with how
it was two years ago', because two years ago it might have been on
different hardware (and sometimes it might have been doing somewhat
different things; we move services between servers every so often).</p>
<p>On a broader level, it feels not too useful to compare current servers
against their past selves unless we could plausibly return to their past
selves. For example, we can't return to the Ubuntu 18.04 version of any
of our servers, because 18.04 is out of support. If the 18.04 version
of server X performed much better than the 22.04 or 20.04 version, well,
we're still stuck with whatever we get on the current versions. However,
there's some use in knowing that performance has gone down, if we can
see that.</p>
<p>Some things we collect metrics for stay fixed for much longer,
though; a prime example is machine room temperature metrics (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/MachineRoomArchaeology">we've
been in our machine rooms for a long time</a>).
Having a long history of temperature metrics for a particular machine
room could be useful to put numbers (or at least visualizations)
on slowly worsening conditions that are only clear if we're comparing
across years, possibly many. Of course there are various possible
explanations for a long term worsening of temperatures, such as
there being more servers in a machine room now, but at least we can
start looking.</p>
<p>Certain sorts of usage and volume metrics are also potentially
useful over long time scales. I don't know if we'll ever want to
look at a half-decade or longer plot of email volume, but I can
imagine it coming up. This only works for some usage metrics, because
with others too many things are changing in the environment around
them. Will we ever have a use for a half-decade plot of VPN usage?
I suspect not because so much that could affect that has changed
over half a decade (and likely will change in the future).</p>
<p>(My current feeling is that a really long metrics history isn't
going to be all that useful for capacity planning for us, simply
because I don't think we have anything that has such a consistent
growth rate over half a decade or a decade. The past few years?
Sure. The past half decade? That's getting chancy because a lot has
changed in local usage patterns, never mind <a href="https://en.wikipedia.org/wiki/COVID-19_pandemic">world events</a>.)</p>
<p>All of this is irrelevant to us today, since <a href="https://prometheus.io/">Prometheus</a>'s current retention policies are all or
nothing. If we wanted to keep only some metrics for an extended
period of time, we'd have to somehow copy them off to elsewhere
(possibly downsampling them in the process). But by the time we
start running into limits on <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">our normal Prometheus server</a>, Prometheus may well have developed
some additional features here.</p>
<p>PS: I suspect that we already have much longer Prometheus metrics
retention that is at all common. I suspect that someday this may get us
into trouble, as we're probably hitting code conditions that aren't well
tested.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/MetricsHowFarBackDepends?showcomments#comments">2 comments</a>.) </div>How far back we want our metrics to go depends on what they're for2024-02-26T21:43:53Z2024-01-10T03:54:05Ztag:cspace@cks.mef.org,2009-03-24:/blog/web/WebPKIEvolutionVsWebServerscks<div class="wikitext"><p>It's recently struck me that one of the things limiting the evolution
of what is called Web <a href="https://en.wikipedia.org/wiki/Public_key_infrastructure">PKI</a>, the
general infrastructure of TLS on the web (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/TLSThreeWorlds">cf</a>),
is that it has turned out that in practice, almost anything that
requires (code) changes to web servers is a non-starter. This is
handily illustrated by the fate of <a href="https://en.wikipedia.org/wiki/OCSP_stapling">OCSP Stapling</a>.</p>
<p>One way to make Web PKI better is to make certificate revocation
work better, which is to say <a href="https://scotthelme.co.uk/revocation-checking-is-pointless/">more or less at all</a>. The
<a href="https://en.wikipedia.org/wiki/Online_Certificate_Status_Protocol">Online Certificate Status Protocol (OCSP)</a> would
allow browsers to immediately check if a certificate was revoked,
but there are a huge raft of problems with that. The only practical
way to deploy it is with <a href="https://en.wikipedia.org/wiki/OCSP_stapling">OCSP Stapling</a>, where web servers would
include a proof from the Certificate Authority that their TLS
certificate hadn't been revoked as of some recent time. However,
to deploy OCSP Stapling, web servers and the environment around
them needed to be updated to obtain OCSP responses from the CA and
then include these responses as additional elements in the TLS
handshake.</p>
<p>Before I started writing this entry I was going to say that OCSP
Stapling is notable by its absence, but this is not quite true.
Using the test on <a href="https://www.feistyduck.com/library/openssl-cookbook/online/testing-with-openssl/testing-ocsp-stapling.html">this OpenSSL cookbook page</a>
suggests that a collection of major websites include stapled OCSP
responses but also that at least as many major websites don't,
including high profile destinations that you've certainly heard of.
Such extremely partial adoption of OCSP Stapling makes it relatively
useless in practice, because it means that no web client or Certificate
Authority can feasibly require it (<a href="https://scotthelme.co.uk/ocsp-must-staple/">a CA can issue certificates that
require OCSP Stapling</a>).</p>
<p>There are perfectly good reasons for this inertia in web server
behavior. New code takes time to be written, released, become common
in deployed versions of web server software, fixed, improved,
released again, deployed again, and even then it often requires
being activated through configuration changes. At any given time,
most of the web servers in the world are running older code, sometimes
very older code. Most people don't change their web server configuration
(or their web server) unless they have to, and also they generally don't
immediately adopt new things that may not work.</p>
<p>(By contrast, browsers are much easier to change; there are only a
few sources of major browsers, and they can generally push out
changes instead of having to wait for people to pull them in. It's
relatively easy to get pretty high usage of some new thing in six
months or a year, or even sooner if a few groups decide.)</p>
<p>The practical result of this is that any improvement to Web PKI that
requires web server changes is relatively unlikely to happen, and
definitely isn't going to happen any time soon. The more you can hide
things behind TLS libraries, the better, because then hopefully only the
TLS libraries have to change (if they maintain API compatibility). But
even TLS libraries mostly get updated passively, when people update
operating system versions and the like.</p>
<p>(People can be partially persuaded to make some web server changes
because they're stylish or cool, such as HTTP/2 and HTTP/3 support.
But even then the code needs to get out into the world, and lots of
people won't make the changes immediately or even at all.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/web/WebPKIEvolutionVsWebServers?showcomments#comments">One comment</a>.) </div>One of the things limiting the evolution of WebPKI is web servers2024-02-26T21:43:53Z2024-01-09T02:40:31Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/TLSCertificateExpiryHackcks<div class="wikitext"><p>Famously, <a href="https://en.wikipedia.org/wiki/Transport_Layer_Security">TLS</a>
certificates expire, which even today can take websites offline because
they didn't renew their TLS certificate in time. This doesn't just affect
websites; people not infrequently create certificates that are supposed to
be long lived, except <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/TenYearsNotLongEnough">sometimes they make them last (only) ten years,
which isn't long enough</a>. When people
argue about this, let's be clear; TLS certificate expiry times, like most
forms of key expiry, are fundamentally a hack that exists to deal with the
imperfections of the world.</p>
<p>In a spherical and frictionless ideal world, TLS certificate keys would
never be compromised, TLS certificates would never be issued to anyone
other than the owner of something, TLS certificates could be effectively
invalidated through revocation, and there would be no need to have TLS
certificates ever expire. In this world, TLS certificates would be
perpetual, and when you were done with a website or some other use of a
TLS certificate, you would publish a revocation of it just to be sure.</p>
<p>We don't live in a spherical and frictionless world, so TLS
certificates expire in order to limit the damage of key compromise
and mis-issued certificates. That TLS certificates expire only
imperfectly limits the damage of key compromise, not only because
you have to wait for the certificate to expire (hence the move to
shorter and shorter lifetimes) but also because there's generally
nothing that stops you from re-using the same key for a whole series
of TLS certificates. Since we don't have effective certificate
revocation at scale, both mis-issued certificates and certificates
where you know the key is compromised can only really be handled
by letting them expire. If they didn't expire, they would be dangerous
forever.</p>
<p>(If you're a big place, the browsers will give you a hand by shipping
an update that invalidates the required certificates and keys, but this
isn't available to ordinary mortals.)</p>
<p>This isn't particularly specific to TLS; other protocols with public
keys often have the same issues and adopt the same solution of
expiry times (PGP is one example). There are protocols that use
keys without expiry times, such as <a href="https://en.wikipedia.org/wiki/DomainKeys_Identified_Mail">DKIM</a>. However,
DKIM has extremely effective key revocation; to revoke a key, you
remove the public part from your DNS, and then no one can validate
anything signed by that key (well, unless they have their own saved
copy of your old DNS). Other protocols punt and leave the whole
problem up to you, for example SSH keypairs.</p>
<p>(Some protocols have other reasons for limiting the lifetime of keys,
such as making encrypted messages 'expire' by default)</p>
<p>The corollary of this is that if you're dealing with TLS certificates
(or keypairs in general) and these issues aren't a concern for you,
there's not much reason to limit your TLS certificate lifetimes.
<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/TenYearsNotLongEnough">Just don't make their lifetimes be ten years</a>.</p>
<p>(My current personal view is that there are two reasonable choices
with TLS certificate lifetimes. Either you have automated issuance
and renewal, in which case you should have short lifetimes, or you
have manual issuance and rollover, in which case they should be as
long as you can get away with. TLS certificates that live for a
year or three and have to be manually rolled over are the worst of
both worlds; a key compromise or a mis-issuance is dangerous for a
comparatively long time, and the rollover period is long enough
that you'll have issues keeping track of it and doing it.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/TLSCertificateExpiryHack?showcomments#comments">One comment</a>.) </div>TLS certificate expiry times are fundamentally a hack2024-02-26T21:43:52Z2024-01-07T23:10:40Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/FilesystemCacheAndNFSBandwidthcks<div class="wikitext"><p><a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">Our current ZFS fileserver hardware</a>
is getting long in the tooth, so we're working on moving to new
hardware (with the same software and operational setup, which we're
happy with). This new hardware has 512 GB of RAM instead of the 192
GB of RAM in our current fileservers, which means that we're going
to have a very big ZFS filesystem cache. Today, I was idly wondering
how long it would take to fill the cache to a reasonable level with
NFS (read) traffic, since <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">we have metrics</a> that include, among other
things, the typical bandwidth our current fileservers see (which
usually isn't all that high).</p>
<p>ZFS doesn't use all of the system's memory for its ARC cache, and
not all of the ARC cache is file data; some of it is filesystem
metadata like the contents of directories, the ZFS equivalent of
inodes, and so on. As a ballpark, I'll use 256 GBytes of file data
in the cache. A single server with a 1G connection can read over
NFS at about 110 Mbytes a second. This is a GByte read in just under
ten seconds, or about 6.4 GBytes a minute, and a bit under 46 minutes
of continuous full-rate 1G NFS reads to fill a 256 GByte cache
(assuming that the ZFS fileserver puts everything read in the cache
and there are no re-reads, which are some big assumptions).</p>
<p>Based on what I've seen on our dashboards, a reasonable high NFS
read rate from a fileserver is in the area of 300 to 400 Mbytes a
second. This is about 23.4 GBytes a minute (at 400 Mbytes/sec), and
would fill the ZFS fileserver cache from a cold start in about 11
minutes (again with the assumptions from above). <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/NetworkRelatedSpeeds2022">400 Mbytes/sec
is well within the capabilities of SSD-based fileservers</a>.</p>
<p>However, most of the time our fileservers are much less active than
that. Last Thursday, the average bandwidth over the workday was in
the area of 1 Mbyte/sec (yes, Mbyte not GByte). At this rate filling
a 256 GByte cache of file data would take three days. A 20 Mbyte/sec
sustained read rate fills the cache in only a few hours. At the low
end, relatively 'small' changes in absolute value clearly have an
outsized effect on the cache fill time.</p>
<p>In practice, this cache fill requires 256 GBytes of different data
that people want to read (possibly in a hurry). This is much more
likely to be the practical limit on filling our fileserver caches,
as we can see by the typical 1 Mbyte/sec data rate.</p>
<p>(All of this is actually faster than I expected before I started
writing this and ran the numbers.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/FilesystemCacheAndNFSBandwidth?showcomments#comments">2 comments</a>.) </div>Some ballpark numbers for fun on filling filesystem cache with NFS traffic2024-02-26T21:43:52Z2024-01-07T02:31:39Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/VMServerQuiteUsefulcks<div class="wikitext"><p>I've been using virtual machines for various sorts of testing and
scratch development for a long time. For many years, that was done
using VMWare Workstation on my work desktop machine, but in early
2022 <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/VirtualizationWithGUIWants">I reached a tipping point of unhappiness with it</a> and subsequently <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LibvirtHasBeenOkay">switched
over to Linux's libvirt</a>. It didn't
take me long to realize that I could do this on one of our servers,
not just my desktop, and <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LibvirtMovingSetup">I built out a VM host server</a>. Both the VM host server and my desktop have
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/LibvirtMySetup2022">basically the same libvirt setup</a>
for (server) testing, where I have a bunch of scratch VMs that are
each set up with a starting install of various Ubuntu versions and
then snapshotted before we do any post-install customization. When
I want to test something on a scratch server, I pick a currently
unused VM, roll it back to the appropriate Ubuntu version snapshot,
and fire it up to finish installing it as whatever I want.</p>
<p>When I first built the VM host server, I wasn't sure how much I was
going to use it. After all, my work desktop was perfectly good at
running virtual machines under libvirt, and it had some conveniences
since the VMs were right there. But over the past year and a half,
I've switched to mostly using the VM host server's waiting VMs when
I want an Ubuntu test server, instead of the VMs on my desktop. To
the extent that I can sort out why I've drifted this way, I think
that having a separate VM host server has removed various quiet
bits of friction and cognitive dissonance.</p>
<p>The obvious advantage is that the VM server is a separate machine
from my desktop and it only runs VMs, so I don't have to worry about
the impact of anything else on any VMs or the impact of VMs on
anything else. If I fire up too many VMs and overload the machine,
the only thing it affects are scratch test VMs. Related to this,
the machine is more stable in practice than my Fedora desktop,
partly because Fedora has fairly frequent kernel updates. The VM
server also has the VM disk images on an ext4 filesystem rather
than a ZFS filesystem; I like ZFS, but it can have unpredictable
interactions with things like virtual machine images (for both IO
volume and memory usage).</p>
<p>As a system administrator, I'm quite aware of the standard jokes
about services (important or otherwise) running on someone's desktop.
The VM host server isn't my desktop; it's a real server, racked in
our machine room, and so it feels better to leave something running
on it than on my desktop. One result is that I'm more comfortable
running scratch services for an extended period of time on the VM
host server. This can be <a href="https://mastodon.social/@cks/111547061035518709">an actual service we're testing</a> or simply letting
a new version of <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLokiSimpleNotRecommended">some piece of software I don't fully trust</a> run for long enough to be a 'burn
in' test for problems.</p>
<p>The summary of all of this is that having a VM host server has made
using virtual machines on it into something that feels like the
easy way to get a real (Ubuntu) server set up temporarily. VMs on
my desktop were just that bit further away from that 'basically a
real server' feeling, and that bit makes a difference.</p>
<p>(I still do run Ubuntu VMs on my work desktop for some things, but
those VMs tend to be for entirely personal experimentation and
poking.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/VMServerQuiteUseful?showcomments#comments">2 comments</a>.) </div>Having a virtual machine host server has been quite useful2024-02-26T21:43:53Z2024-01-06T04:15:19Ztag:cspace@cks.mef.org,2009-03-24:/blog/programming/UnmaintainedCodeHugeValuecks<div class="wikitext"><p>I recently read Aaron Ballman's <a href="https://blog.aaronballman.com/2023/12/musings-on-the-c-charter/">Musings on the C charter</a>
(<a href="https://lobste.rs/s/ybk77k/musings_on_c_charter">via</a>). As part
of musing on backward compatibility in new versions of the C standard,
Ballman wrote:</p>
<blockquote><p>[...] I would love to see this principle updated to set a time limit,
along the lines of: existing code that has been maintained to not use
features marked deprecated, obsolescent, or removed in the past ten
years is important; unmaintained code and existing implementations
are not. If you cannot update your code to stop relying on deprecated
functionality, your code is not actually economically important —
people spend time and money maintaining things that are economically
important.</p>
</blockquote>
<p>To put it one way, I disagree strongly with the view that 'unmaintained'
code is not valuable or important. In the open source world, I
routinely use a significant number of programs and a large amount
of code that no longer sees meaningful changes and development.
This code may be maintained in the sense that there is someone who
will fix security issues and important bugs, and maybe make a few
changes here and there, but it is not 'maintained' in the sense
that I think Ballman means, where it undergoes enough development
that changing away from newly deprecated functionality (in C or any
other language) would be lost in the noise.</p>
<p>(This code has 'value' in the sense that there's a community of
people who are (still) using the software, often happily and by
choice. Often these are relatively small communities, although not
always. If there's no community still using the code, then it's
mostly unimportant.)</p>
<p>Some of this open source code is genuinely more or less finished;
its authors don't have any particular features they want to add to
it. Other amounts of this open source code has fallen out of favour,
with no one left behind that is interested in active development
to move it forward, but potentially with plenty of people who derive
value from its current working state. You can probably name projects
for each camp.</p>
<p>(An extremely relevant example of the second case is <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/WaylandNowTheFuture">X</a>. People are quite aggressively not
doing any further development of X, and they're happy to tell you
about it. At the same time, <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/WaylandTechnicalMeritsIrrelevant">a great deal of things still run on
X</a>.)</p>
<p>Making a future version of C (as an example) unable to build those
code bases is effectively branching the language, in much the same
way (although probably to a lesser extent) than the Python 2 versus
Python 3 split. If C compilers fully support the old version of C,
everything is probably reasonably fine; even though the language
has branched, old projects can continue to use the old language
forever. If C compilers start deciding that they want to drop the
old version of C because it's been a while, we are not so fine.</p>
<p>PS: This has also come up in the case of Go. Old Go code itself
still compiles and works fine, but the Go build environment has
changed significantly enough that old code only builds through what
is basically a hack in the main Go toolchain. Based on personal
experience, I can tell you that there are a number of Go programs
out there in the world that have not had even the minimal update
to build using Go modules, but which are likely still used.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/programming/UnmaintainedCodeHugeValue?showcomments#comments">One comment</a>.) </div>'Unmaintained' (open source) code represents a huge amount of value2024-02-26T21:43:52Z2024-01-05T04:33:19Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/TenYearsNotLongEnoughcks<div class="wikitext"><p>The recent news of the time interval is that <a href="https://mstdn.social/@jschauma/111688103150154714">a ten year certificate
built into every Keybase client expired at the end of 2023</a>. I had <a href="https://mastodon.social/@cks/111689032364322410">a little
reaction to this</a>:</p>
<blockquote><p>So much quiet damage has been done to so many people's pinned cert or
private CA deployments by ten-year defaults. So much.</p>
<p>(It happened to us so now our private OpenVPN CA root is much, much
longer.)</p>
</blockquote>
<p>If you're making an internal thing good for ten years, don't (whether
it is a TLS certificate or, for example, setting a retention duration
for some database). Ten years is either not long enough or too long.
If you're writing an example and use a ten year validity period,
please don't (people are sure to copy it). Ten years can sound like
an implausibly long time when you're setting something up, but as
we all know there's nothing quite as permanent as a quick hack and
there you are ten years later with some problems. <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/FailingAtTLSRootRollover">This happened
to us with OpenVPN</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OpenVPNTLSRootExpirySolution">we failed to
find a solution</a>, to our pain.</p>
<p>If you don't have a plan to either roll over and renew whatever it
is or definitely take it out of service, ten years is far too short.
If you want to keep something active until you change your mind,
set your not-after date, retention duration, or whatever to as high
as you can. Ten years is a terrible value for this; it looks and
feels long enough but it isn't anywhere near sufficient.</p>
<p>(In some cases it may be a little bit of a problem that we're now
only fourteen years away from <a href="https://en.wikipedia.org/wiki/Year_2038_problem">the year 2038 issue</a>. But if you have
a problem there, it's better to find out about it early.)</p>
<p>If you want to definitely take your thing out of service after a
certain period of time, ten years is far too long. If your thing
is still in use as ten years approaches, it's almost guaranteed to
have wormed its way into all sorts of places, so that taking it out
of service will cause explosions and the possibility of this will
get people to show up demanding that its lifetime be extended
(assuming that people even remember where it's used and what depends
on it).</p>
<p>If you have a plan to roll over, extend, or renew the duration, ten
years is also far too long. As everyone has found out through painful
experience with TLS certificates, doing something only once every
few years is a recipe for problems (and for forgetting that it needs
to be done at all). To make sure you keep in practice and the process
still works, you need to do this much more frequently than once
every ten years. At this point I'd probably try to do it twice a
year.</p>
<p>As it happens I'm not without sin, because the retention time for
<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">our Prometheus database</a> is currently
'3650 days' from late November 2018. We may well run out of disk
space before then and when we get there we may decide we don't need
ten years (or more than ten years) after all, but I should probably
bump that up anyway since our intention is 'keep as much as we have
disk space for and maybe get more disk space if we're running short'.
I should definitely have some sort of calendar entry for it, as a
reminder just in case.</p>
<p>(We do now check all of our long-lived internal TLS certificates
and alert if their expiry time is getting close. It's probably
overkill but it doesn't hurt. And some of them are less than a
decade away at this point.)</p>
<p>(This is related to <a href="https://utcc.utoronto.ca/~cks/space/blog/web/NotForever">a much older discovery that '999 days' is not
forever</a>, and for that matter neither is 9999
days, although that's 27 years and change so it's closer. 99999
days is long enough for almost everyone to not worry about it,
though.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/TenYearsNotLongEnough?showcomments#comments">4 comments</a>.) </div>Ten years isn't long enough for maximum age settings2024-02-26T21:43:53Z2024-01-04T03:02:58Ztag:cspace@cks.mef.org,2009-03-24:/blog/unix/LseekWhyNamedThatcks<div class="wikitext"><p>Over on the Fediverse <a href="https://nondeterministic.computer/@mjg59">Matthew Garrett</a> said something which
sparked a question from <a href="https://social.treehouse.systems/@nicolas17">Nicolás Alvarez</a>:</p>
<blockquote><p><a href="https://nondeterministic.computer/@mjg59/111684516891797119">@mjg59</a>:
This has been bothering me for literally decades, but: why is the
naming for fstat/lstat not consistent with fseek/lseek</p>
<p><a href="https://social.treehouse.systems/@nicolas17/111684529206469241">@nicolas17</a>:
why is it even called lseek instead of seek?</p>
</blockquote>
<p>The most comprehensive answer to both questions came from <a href="https://hackers.town/@zwol/111684752785283333">Zack
Weinberg's post</a>,
with <a href="https://101010.pl/@nabijaczleweli/111684594178728866">a posting by наб</a> and also
<a href="https://mastodon.social/@cks/111684592619236825">some things from me</a>
adding additional historical information about <code>lseek()</code>. So today
I'm going to summarize the situation with some additional information
that's not completely obvious.</p>
<p>The first version of Unix (V1) had a '<code>seek()</code>' system call. Although
C did not yet exist, this system call took three of what would be
<code>int</code>s as arguments, Since Unix was being written on the <a href="https://en.wikipedia.org/wiki/PDP-11">Digital
PDP-11</a>, a '16-bit' computer,
these future ints were the natural register size of the PDP-11,
which is to say they were 16 bits. Even at the time this was
recognized as a problem; the <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V1/man/man2/seek.2">OCR'd V1 <code>seek()</code> manual page</a>
says (transformed from hard formatting, and <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V1">cf</a>):</p>
<blockquote><p>BUGS: A file can conceptually be as large as 2**20 bytes. Clearly
only 2**16 bytes can be addressed by seek. The problem is most
acute on the tape files and <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V1/man/man4/rk0.4">RK</a> and <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V1/man/man4/rfo.4">RF</a>. Something
is going to be done about this.</p>
</blockquote>
<p>V1 also had a closely related <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V1/man/man2/tell.2"><code>tell()</code></a>
system call, that gave you information about the current file offset.
The V1 <code>seek()</code> was system call 19, and <code>tell()</code> was system call
20. The <code>tell()</code> system call seems to disappear rapidly, but its
system call number remained reserved for some time. In the <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V4/nsys/ken/sysent.c">V4
sysent.c</a>
it's 'no system call', and then in the <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V5/usr/sys/ken/sysent.c">V5 sysent.c</a> system
call 20 is <code>getpid()</code>.</p>
<p>In V4 Unix, <code>seek()</code> still uses what are now C ints, but <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V4/man/man2/seek.2"><code>seek()</code>'s
manual page</a>
documents a very special hack to extend its range. If the third
parameter is 3, 4, or 5 instead of 0, 1, or 2, the seek offset is
multiplied by 512. At this point, C apparently didn't yet have a
<code>long</code> type that could be used to get 32-bit integers on the PDP-11,
so the actual kernel implementation of <code>seek()</code> used an array of
two <code>ints</code> (in <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V4/nsys/ken/sys2.c">ken/sys2.c</a>),
an implementation that stays more or less the same through V6's
kernel <code>seek()</code> (still in <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V6/usr/sys/ken/sys2.c">ken/sys2.c</a>).</p>
<p>(The V6 C compiler appears to have implemented support for a new
'<code>long</code>' C type modifier, but it doesn't seem to have been documented
in the C manual or used in, eg, the kernel's <code>seek()</code> implementation.
Interested parties can play around with it in places like <a href="https://research.swtch.com/v6/">this
online V6 emulator</a>.)</p>
<p>Then finally in V7, we have C <code>long</code>s and along with them a (renamed)
version of the <code>seek()</code> system call that finally fixes the limited
range issue by using <code>long</code>s instead of <code>int</code>s for the relevant
arguments (the off_t type would be many years in the future).
However, the <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/man/man2/lseek.2">V7 <code>lseek()</code> system call</a>
thriftily reuses <code>seek()</code>'s system call number 19 (cf <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/libc/sys/lseek.s">libc/sys/lseek.s</a>,
and you can compare this against <a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V5/usr/source/s4/seek.s">the V5 lseek.s</a>).
It seems probable that this is why V7 renamed the system call from
<code>seek()</code> to <code>lseek()</code>, in order to force any old code using <code>seek()</code>
to fail to link. Since V7 C did not have function prototypes (they
too were years in the future), old code that called <code>seek()</code> with
<code>int</code> arguments would almost certainly have malfunctioned, passing
random things from the stack to the kernel as part of the system
call arguments.</p>
<p>(Old V6 binaries were on their own, but presumably this wasn't seen
as a problem in the early days of Unix.)</p>
<p>So the reason Unix uses '<code>lseek()</code>' instead of '<code>seek()</code>' is that
it once had a '<code>seek()</code>' system call that took ints as arguments
instead of longs, and when this system call changed to take longs
it was renamed to have an l in front to mark this, becoming
'<code>lseek()</code>'. The 'l' here is for 'long'. However, <a href="https://hackers.town/@zwol/111684752785283333">as covered by
Zack Weinberg</a>, this
is an odd use of 'l' in Unix system call names. In the stat() versus
lstat() case, the 'l' is for special treatment of symbolic names,
and both versions of the system call still exist.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/LseekWhyNamedThat?showcomments#comments">2 comments</a>.) </div>Why Unix's <code>lseek()</code> has that name instead of '<code>seek()</code>'2024-02-26T21:43:52Z2024-01-03T04:03:15Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/NTPStratumAlertNotUsefulcks<div class="wikitext"><p>One of the concepts and jargon of <a href="https://en.wikipedia.org/wiki/Network_Time_Protocol">NTP (the Network Time Protocol)</a> is a NTP
server's <em>stratum</em>, which is roughly how far away (in NTP servers)
you are from an external source of time. External sources of time
are stratum 0, NTP servers that are directly connected to them are
stratum 1, servers that talk to those are stratum 2, and so on.
For load reasons, organizations with stratum 0 time sources often
put an extra level of NTP server in between the public and those
time sources; an internal NTP server talks directly to the clocks
(and is at stratum 1), while the organization's public NTP servers
that you can use are at stratum 2 (or higher, depending on the
architecture). <a href="https://nrc.canada.ca/en/certifications-evaluations-standards/canadas-official-time/network-time-protocol-ntp">The Natial Research Council Canada's public NTP
servers</a>
are at stratum 2, for example.</p>
<p><a href="https://support.cs.toronto.edu/">We</a> have long had a set of
internal NTP servers that are the local time source for our servers
(<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/NTPDaemonWhen">for various reasons</a>). These synchronize to various
off-network NTP servers, both inside and outside the university.
As part of monitoring our environment, I wrote some tools to provide
<a href="https://prometheus.io/">Prometheus</a> metrics for NTP state and get
these into <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">our Prometheus environment</a>.
Once I had metrics, I wrote some alerts, including an alert for the
NTP stratum of our internal NTP servers being 'too high', which I
believe started out at stratum 6 (that's foreshadowing). For a long
time things were quiet, and then recently the alert started going
off every so often; one particular internal server was winding up
at various high stratum numbers. Every time I investigated, this
was at one level a legitimate occurrence; our NTP server (that we
were alerting about) was one stratum higher than the off-network
server it was synchronized to (as it's supposed to be), but that
off-network server was inexplicably at a relatively high stratum.</p>
<p>I tried adjusting the NTP stratum level that would trigger the alert
a few times, but by now I've come around to the idea that this alert
isn't useful in general and I'm probably going to remove it. We
already have other alerts that will trigger if our local NTP server
can't synchronize its time to anything or has time that's clearly
off from everything else, so the NTP stratum alert is really telling
us either that something weird is going on with our upstream NTP
servers or our NTP server daemon has a bug in its stratum calculation
(which isn't very likely).</p>
<p>Both of these are problems and maybe we should investigate the one
that we know is happening, but neither of them are really problems
that we can do anything about. If there's a bug in the NTP server
daemon we're using, our options are limited, and the off-network
servers that are behaving oddly aren't under our control at all so
our only option is to maybe stop using them. However, it's not clear
if we should do so. The NTP system has direct indications of the
quality of remote NTP servers and is carefully designed to reject
'false tickers', sources of bad time. The server's stratum is not
one of these markers of good or bad time, all it tells you is how
many hops away from a true clock the server thinks it is. A NTP
server can be a perfectly good source of time despite a high stratum,
and in our cases the affected off-network servers were; despite the
high stratum, our local NTP server was using the off-network server
with the high stratum as its time source (otherwise our local server
wouldn't have had its high stratum).</p>
<p>If we kept this alert I'd want to try to dig into the off-network
time servers that have this elevated stratum, because there's
something mysterious going on there and that means there are potential
problems. But there are mysteries and potential problems in many
places and I have to choose my quests. I can't try to chase down
every anomaly, even if there's a fixable problem at the root of
every one of them. I need to pick the ones that matter to us, and
someone else's NTP server having a high stratum is not one of those.</p>
</div>
Alerting on our NTP servers having a high NTP stratum hasn't been useful2024-02-26T21:43:53Z2024-01-02T03:41:37Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SoftwareRaidSwitchingDiskscks<div class="wikitext"><p>Back at the start of this year I moved my (software RAID) root
filesystem on <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/HomeMachine2018">my home Fedora desktop</a> from a
mirrored pair of SATA SSDs to a pair of NVMe drives, and this time
I kept notes (although I didn't necessarily follow them). For my
future use, I'm going to write this up, complete with the steps
that I should have done but didn't.</p>
<p>(In this switch, my new disks are nvme0n1p3 and nvme1n1p3, my old
disks were sda3 and sdb3, and md10 was the official name of my root
filesystem's software RAID mirror.)</p>
<p>As is my custom with such disk switches, I first changed my root
filesystem software RAID to being a four way mirror, using both the
SATA SSDs and the NVMe drives. The process for this is to add the extra
devices and then increase the number of devices in the RAID:</p>
<blockquote><pre style="white-space: pre-wrap;">
mdadm -a /dev/md10 /dev/nvme0n1p3
mdadm -a /dev/md10 /dev/nvme1n1p3
mdadm -G -n 4 /dev/md10
</pre>
</blockquote>
<p>If you don't increase the number of devices, you've just added some
spares. This is definitely not what I want; when I do this, I want
the new drives to be in (full) use in parallel to the old ones, as
a burn-in test. (Often an extended one, as it was this time.)</p>
<p>(If you want you can add one device at a time then let your
system run that way for a bit, but I usually don't see any
reason to go through extra steps.)</p>
<p>In the past you needed to update /etc/mdadm.conf to have the new
number of drives in your software RAID array and rebuild your
initramfs (to update its embedded copy of mdadm.conf) or you'd have
boot failures (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/RaidGrowthGotcha">cf</a>). Currently this isn't (or
wasn't) necessary on Fedora, as things appear to accept software
RAID arrays that have more member devices than mdadm.conf specifies,
as I found out when there was an unplanned machine freeze and reboot
before I did the initramfs update.</p>
<p>(<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidDiskCountEffects">Alternately you should take the count of devices out entirely
from your mdadm.conf</a>. Your initramfs
will have to be rebuilt before this takes full effect, but you can
perhaps wait for this to happen as part of your distribution's next
kernel update.)</p>
<p>Once you've decided that your new drives are stable, you transition
away from the old devices by marking them failed and then removing
them:</p>
<blockquote><pre style="white-space: pre-wrap;">
mdadm --fail /dev/md10 /dev/sda3
mdadm --fail /dev/md10 /dev/sdb3
mdadm --remove /dev/md10 /dev/sda3
mdadm --remove /dev/md10 /dev/sdb3
</pre>
</blockquote>
<p>You must use '--remove', not '-r'. After doing this there are two
essential things you need to do, neither of which I actually did,
to my eventual sorrow. First, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidRemovingDiskGotcha"><strong>you have to zero the RAID superblocks
on the old devices</strong></a> (this has
been an issue for <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidShiftingMirrorII">a long time</a>):</p>
<blockquote><pre style="white-space: pre-wrap;">
mdadm --zero-superblock /dev/sda3
mdadm --zero-superblock /dev/sdb3
</pre>
</blockquote>
<p>If you don't zero the old superblocks, <a href="https://mastodon.social/@cks/109739661990640799">your system may well reboot
with their old version of your root filesystem instead of the current
one</a>, and you'll
have to immediately halt the system and physically pull the old
drives (you might as well dust it out while you have it open, if
this is a desktop). If you had other stuff on the old drives in
addition to the old software RAID mirrors, well, you would be in
some trouble.</p>
<p>Once you've removed the old disks (and zeroed their superblocks),
you then need to shrink the number of devices in the software RAID
array back down to two devices (otherwise various things will
complain about missing devices):</p>
<blockquote><pre style="white-space: pre-wrap;">
mdadm -G -n 2 /dev/md10
</pre>
</blockquote>
<p>However, unlike the case of adding drives, <strong>after shrinking the
number of devices in the array you have to update /etc/mdadm.conf
to have the new device count and then rebuild your initramfs</strong> so
that it includes your new mdadm.conf; on Fedora this is done with
with '<code>dracut --force</code>'. Fedora's Dracut initramfs environment will
accept a software RAID array with more devices than specified, but
(perhaps reasonably) it will refuse to accept one with fewer devices.
Alternately, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidDiskCountEffects">you can completely remove <code>num-devices=</code> from your
mdadm.conf</a>, although you'll still
need to rebuild your initramfs if you haven't done this already.</p>
<p>(I believe you get dropped into an emergency rescue shell and are
left to fix things up yourself. I didn't keep notes on this process;
interested parties are encouraged to experiment in a virtual machine.)</p>
<p>When I moved away from the old SATA SSDs, I forgot to zero the old
RAID superblocks and then (after fixing that) I discovered that I'd
incorrectly assumed that Fedora's initramfs didn't care about all
drive number changes. Hopefully I'll remember next time around, or
at least re-read this entry, which is (or was) current as of my
experiences in early to mid 2023 (things keep changing in this area
of Linux).</p>
<p>As advice for my future self, what I should have done is <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/AlwaysMakeAChecklist">written
out a full checklist in advance</a>
and then ticked things off as I went through them. This would have
made sure that I didn't forget important steps (like zeroing the
old RAID superblocks), or let them slide with the excuse that they'd
happen as a side effect of my next kernel update (because my system
can always reboot by surprise before then).</p>
<p>(I've written entries about this in the past, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidShiftingMirror">1</a>, <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidShiftingMirrorII">2</a>,
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidRemovingDiskGotcha">3</a>, as well as <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ShrinkingSoftwareRAIDSwap">shrinking a
mirrored swap partition</a>.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SoftwareRaidSwitchingDisks?showcomments#comments">2 comments</a>.) </div>Switching Linux software RAID disks around in (early) 20232024-02-26T21:43:53Z2024-01-01T03:52:23Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/EmailAddressesBadPermanentIDscks<div class="wikitext"><p>Every so often someone needs to create a more or less permanent
internal identifier in their system every person's account. Some
of the time they look at how <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OIDCThreeEmailAddresses">authentication systems like OIDC
return email addresses among other data</a> and decide that since pretty
much everyone is giving them an email address, they'll use the email
address as the account's permanent internal identification.
As <a href="http://regex.info/blog/2006-09-15/247">the famous saying</a> goes,
now you have two problems.</p>
<p>The biggest problem with email addresses as 'permanent' identifiers
is that people's email addresses change even within a single
organization (for example, <a href="https://www.utoronto.ca/">a university</a>).
They change for the same collection of reasons that people's
<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/FullLegalNamesProblems">commonly used names</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/LoginsDoChange">logins change</a>. An organization that refuses to change
or redo the email addresses it assigns to people is being unusually
cruel in ways that are probably not legally sustainable in any
number of places.</p>
<p>(Some of the time there will be some sort of access or forwarding
from the old email address to the new one, but even then the old
email address may no longer work for non-email purposes such as
OIDC authentication. And beyond that, the person won't want to keep
using their old and possibly uncomfortable email address with you,
they want to use their new current one.)</p>
<p>The lesser problem is that you have no particular guarantee that
an organization won't reuse email addresses, either in general or
for particularly desirable ones that get reused or reassigned as
an exception because someone powerful wants them. Sometimes you
sort of have no choice, because account recovery has to run through
the email address you have on file, but at other times (such as in
theory with <a href="https://en.wikipedia.org/wiki/OpenID#OpenID_Connect_(OIDC)">OIDC</a>), you
have some form of internal ID that is supposed to be unique and
permanent, which you should use.</p>
<p>Even if you have to remember an email address for account recovery,
<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/PowerOfMeaninglessIDs">you want your internal identifier for accounts to be meaningless</a>. This will make your life much simpler in
the long run, even if this is never exposed to people.</p>
<p>(There are also security issues lurking in the underbrush of reading
too much into email addresses, <a href="https://trufflesecurity.com/blog/google-oauth-is-broken-sort-of/">cf</a> (<a href="https://news.ycombinator.com/item?id=38720544">via</a>).)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/EmailAddressesBadPermanentIDs?showcomments#comments">2 comments</a>.) </div>Email addresses are not good 'permanent' identifiers for accounts2024-02-26T21:43:52Z2023-12-31T04:22:46Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/ZFSPanicsNotKernelPanicscks<div class="wikitext"><p>Suppose that you have <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSFileserverSetupIII">a ZFS based server</a> and
one day its kernel messages contain the following:</p>
<blockquote><pre style="white-space: pre-wrap;">
VERIFY3(sa.sa_magic == SA_MAGIC) failed (1446876386 == 3100762)
PANIC at zfs_quota.c:89:zpl_get_file_info()
Showing stack for process 6711
CPU: 13 PID: 6711 Comm: dp_sync_taskq Tainted: P O 5.15.0-88-generic #98-Ubuntu
Hardware name: Supermicro Super Server/X11SPH-nCTF, BIOS 2.0 11/29/2017
Call Trace:
<TASK>
show_stack+0x52/0x5c
dump_stack_lvl+0x4a/0x63
dump_stack+0x10/0x16
spl_dumpstack+0x29/0x2f [spl]
spl_panic+0xd1/0xe9 [spl]
? dbuf_rele_and_unlock+0x134/0x540 [zfs]
[...]
</pre>
</blockquote>
<p>Obviously you've hit a ZFS kernel panic, where ZFS handles internal
problems in <a href="https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSPanicOnCorruptionFlaw">its traditional way</a>,
which is to say by panicing and crashing your server. Except that
is almost certainly a lie.</p>
<p>Unless you've changed a non-obvious ZFS kernel parameter, your Linux
kernel has not actually paniced; ZFS is merely pretending that it
has. We can actually see this in the kernel stack trace being shown
here, which lists <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/spl/spl-err.c#L41">spl_dumpstack()</a>
and especially <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/spl/spl-err.c#L49">spl_panic()</a>.
There's also a comment about this in <a href="https://github.com/openzfs/zfs/blob/master/module/os/linux/spl/spl-err.c#L30">the source file</a>:</p>
<blockquote><p>It is often useful to actually have the panic crash the node so you
can then get notified of the event, get the crashdump for later
analysis and other such goodies. <br>
But we would still default to the current default of not to do that.</p>
</blockquote>
<p>Let me be clear: I think this is a terrible choice for almost
everyone except ZFS developers themselves. This looks like a kernel
panic to non-experts, in that it has 'PANIC' in the message, it
dumps very similar information to a <a href="https://en.wikipedia.org/wiki/Linux_kernel_oops">Linux kernel OOPS</a> or other 'panic',
and so on. However, because it's not an actual panic it won't trigger
<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/RebootOnPanicSettings">any kernel settings you've made to force reboots on panics</a>. Instead it will likely leave your ZFS
fileserver with a steadily increasing number of ZFS kernel threads
hung waiting for locks, and then force you to reboot things by hand
when the problems get really bad (probably uncleanly). This can
leave you rather puzzled about what's going on and cause unclear
system problems for the proverbial some time (we had one fileserver
last for over an hour in this state before it became non-functional
enough to trigger alerts).</p>
<p>To force ZFS on Linux to actually panic the kernel when ZFS hits
one of these internal 'panics', you need to set the SPL module
parameter <a href="https://openzfs.github.io/openzfs-docs/man/master/4/spl.4.html#spl_panic_halt">spl_panic_halt</a>
to 1. On a live system, this is done with:</p>
<blockquote><pre style="white-space: pre-wrap;">
echo 1 >/sys/module/spl/parameters/spl_panic_halt
</pre>
</blockquote>
<p>To make this permanent, you'll need to create a suitable .conf file
in /etc/modprobe.d, for example:</p>
<blockquote><pre style="white-space: pre-wrap;">
$ cat /etc/modprobe.d/spl.conf
options spl spl_panic_halt=1
</pre>
</blockquote>
<p>I recommend including some comments about why this is necessary, so
in the future you can understand why you have this mysterious setting.</p>
<p>In an ideal world, the text 'PANIC' in these non-panics would be
replaced with something less misleading, like 'SPL-PANIC' or
'SPL-HALTING' (unless the system was actually panicing). That would
at least make it clear that this was not a regular kernel panic and
came from ZFS's SPL component, not the regular kernel. Better would
be to change the default of spl_panic_halt or to otherwise align
these SPL panics with normal Linux kernel bug handling.</p>
<p>PS: This doesn't have in OpenZFS on FreeBSD, where <a href="https://github.com/openzfs/zfs/blob/master/module/os/freebsd/spl/spl_misc.c#L95">the FreeBSD
version of spl_panic()</a>
simply calls <a href="https://man.freebsd.org/cgi/man.cgi?query=vpanic&apropos=0&sektion=9&manpath=FreeBSD+11-current&format=html">vpanic(9)</a>
and so triggers FreeBSD's normal kernel panic behavior and
infrastructure.</p>
</div>
Your kernel panics in ZFS on Linux probably aren't actual kernel panics2024-02-26T21:43:53Z2023-12-30T02:46:26Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/PrometheusBlackboxHTTPDurationscks<div class="wikitext"><p><a href="https://github.com/prometheus/blackbox_exporter">Prometheus's Blackbox exporter (<em>Blackbox</em>)</a> is the <a href="https://prometheus.io/">Prometheus</a> component that you usually use to make
external checks on services, such as whether a HTTP URL is responding
the way it should. <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusBlackboxNotes">Blackbox's various ways of checking things are
called <em>probes</em> (or probers)</a>, and they
can report various sorts of metrics; for instance, any Blackbox
check involving TLS will provide you with <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusBlackboxTLSExpiry">some TLS expiry metrics</a>. One of the HTTP metrics that Blackbox
provides is probe_http_duration_seconds, which reports on how
long various <em>phases</em> of an HTTP (or HTTPS) check took to complete.
However, Blackbox doesn't currently document what those phases are
or what they cover, and for reasons beyond the scope of this entry
I recently looked them up in <a href="https://github.com/prometheus/blackbox_exporter/blob/master/prober/http.go">the source code</a>,
cross referenced to the underlying tracing code in <a href="https://pkg.go.dev/net/http/httptrace">net/http/httptrace</a>.</p>
<p>At the moment, these phases are:</p>
<dl><dt><code>resolve</code></dt>
<dd>Performing DNS resolution; this isn't necessarily present if
your check uses an IP address instead of a name.<p>
</dd>
<dt><code>connect</code></dt>
<dd>How long it took to make the connection. I'm not sure how
this interacts with any connection pooling that Blackbox may be doing.</dd>
<dt><code>tls</code></dt>
<dd>How long it took to perform the TLS handshake, if this is a HTTPS
request. It's not present on HTTP requests.<p>
</dd>
<dt><code>processing</code></dt>
<dd>How long it took between completing the connection and
receiving the first byte of response. This appears to include the time
to transmit the request and any headers included.<p>
</dd>
<dt><code>transfer</code></dt>
<dd>How long it took between receiving the first byte and the
last byte of the connection.</dd>
</dl>
<p>If your request followed a chain of redirections, I believe that
these numbers are summed across all of the requests for each phase.
You can partially detect redirections with by looking for a non-zero
probe_http_redirects metric (sometimes in combination with
probe_http_status_code not being a 30x status code, because if
you have a check that's specifically for if there's a redirection
that doesn't follow the redirection, probe_http_redirects will
still be 1).</p>
<p>If we assume that the time to transmit the request is negligible
and the server generated the entire response before it began
transmitting any bytes (which isn't uncommon in various situations),
then the server's internal response time is in the '<code>processing</code>'
phase. If the server streamed out the response as it was being
generated, the server's internal response time is some combination
of '<code>processing</code>' and part or almost all of '<code>transfer</code>'.</p>
<p>Slowness in '<code>connect</code>' and '<code>tls</code>' can be because of network issues
or because the host is heavily loaded. The TCP 'connect' phase is
normally mostly or entirely handled in the kernel, but TLS requires
various user level code to run and consume CPU for signing things,
so an overloaded web server might show load spikes here. I believe
that a very slow '<code>connect</code>' is likely a sign that the web server
is sufficiently overloaded that its (Unix) kernel has run out of
its <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/ListenBacklogMeaning">'listen backlog'</a> and is now
waiting for the web server to accept some more connections (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/ConcurrentConnectionLimits">see
also</a>).</p>
<p>Blackbox also has a probe_duration_seconds general metric that
gives you the total time of the probe, covering everything done in
it. For HTTP probes, it's not necessarily the case that the time
for the '<code>processing</code>' phase is most of this total duration and
dominates the other phases; it depends on how fast the web server
is (and how little work you're asking it to do), in addition to
whether or not it's streaming a response as it generates it (which
is clearly the case for one server we check, based on the 'transfer'
times). Also, the sum of the time taken for all phases doesn't
necessarily match up to the probe_duration_seconds total; for
our checks the sum can be only around 90% of the total, although
the absolute differences are small.</p>
<p>My overall conclusion is that if you want to track how long processing
requests takes on your web servers, you probably don't want to rely
on Blackbox checks to give you a clear answer. You're better off
instrumenting things in the web server or your application; fortunately
Apache makes it relatively easy to log server side numbers if you want.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusBlackboxHTTPDurations?showcomments#comments">One comment</a>.) </div>The various phases of Prometheus Blackbox's HTTP probe2024-02-26T21:43:53Z2023-12-29T03:03:16Ztag:cspace@cks.mef.org,2009-03-24:/blog/web/CGINotSlowcks<div class="wikitext"><p>I recently read <a href="https://rednafi.com/go/reminiscing_cgi_scripts/">Reminiscing CGI scripts</a> (<a href="https://lobste.rs/s/jcm2am/reminiscing_cgi_scripts">via</a>), which talked
about <a href="https://en.wikipedia.org/wiki/Common_Gateway_Interface">CGI scripts</a> and in
passing mentioned that they fell out of favour for, well, let me
quote:</p>
<blockquote><p>CGI scripts have fallen out of favor primarily due to concerns related
to performance and security. [...]</p>
</blockquote>
<p>This is in one sense true. Back in the era when CGIs were pushed
aside by PHP and other more complex deployment environments like
Apache's <a href="https://perl.apache.org/">mod_perl</a> and <a href="https://modwsgi.readthedocs.io/">mod_wsgi</a>, their performance was an issue,
especially under what was then significant load. But this isn't
because CGI programs are intrinsically slow in an absolute sense;
it was because computers in the early 00s were not very powerful
and might even be heavily (over-)shared in virtual hosting environments.
When the computers acting as web servers couldn't do very much in
general, everything less you could get them to do could make a
visible difference, including not starting a separate program or
two for each request.</p>
<p>Modern computers are much faster and more powerful than the early
00s servers where PHP shoved CGIs aside; even a low end VPS is
probably as good or better, with more memory, more CPU, and almost
always a much faster disk. And unsurprisingly, CGIs have gotten a
lot faster and a lot better at handling load in absolute terms.</p>
<p>To illustrate this, I put together a very basic CGI in Python and
Go, stuck them in <a href="https://www.cs.toronto.edu/~cks/">my area on our general purpose web server</a>, and tested how fast they would
run. On our run of the mill Ubuntu web server, the Python version
took around 17 milliseconds to run and the Go version around four
milliseconds (in both cases when they'd been run somewhat recently).
Because the CGIs are more or less doing nothing in both cases, this
is pretty much measuring the execution overhead of running a CGI.
A real Python CGI would take longer to start because it has more
things to import, but even then it's not necessarily terribly slow.</p>
<p>(As another data point, I have ongoing numbers for the response
time of <a href="https://utcc.utoronto.ca/~cks/space/blog/">Wandering Thoughts</a>, which is <a href="https://utcc.utoronto.ca/~cks/space/blog/web/DynamicNeedNotBeSlow">a rather complex
piece of Python normally running as a CGI</a>.
On fairly basic (virtual) hardware, it seems to average about 0.17
seconds for the front page (including things like TLS overhead),
which is down noticeably from a decade ago.)</p>
<p>Given that <a href="https://utcc.utoronto.ca/~cks/space/blog/web/CGIAttractions">CGI scripts have their attractions for modest scale
pages and sites</a>, it's useful to know that CGI
programs are not as terrible as they're often made out to be in old
sources (or people working from old sources). Using a CGI program
is a perfectly good deployment strategy for many web applications
(and <a href="https://utcc.utoronto.ca/~cks/space/blog/web/ApacheCGIsAndLocationACLs">you can take advantages of other general web server features</a>).</p>
<p>(Yes, your CGI may slow down if you're getting a hundred hits a
second. How likely is that to happen, and if it does, how critical
is the slowdown? There are some environments where you absolutely
want and need to plan for this, but also quite a number where you
don't.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/web/CGINotSlow?showcomments#comments">2 comments</a>.) </div>Web CGI programs aren't particularly slow these days2024-02-26T21:43:53Z2023-12-28T04:01:48Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/EnvironmentsAndImmitationscks<div class="wikitext"><p>I've been using <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/ToolsEmail">exmh</a> to read and handle my email for
what is now a very long time, and I'm completely used to its specific
features and behaviors. Or rather that was true until very recently,
when for <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/LatencyImpactMyXExperience">reasons beyond the scope of this entry</a> I switched more or less completely
away from exmh to <a href="https://www.gnu.org/software/emacs/manual/html_mono/mh-e.html">MH-E in GNU Emacs</a>.
Me being me and GNU Emacs being GNU Emacs, this immediately set me
off on <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/CustomizationSensibleLimits">an extended process of customizing MH-E to work better for
me</a>.</p>
<p>When I started on this journey of customization, my more or less
implicit goal was to reproduce as much of my <a href="http://www.beedub.com/exmh/">exmh environment</a> as possible in MH-E. It was the
productive environment I was used to and had a lot of experience
with (and had customized significantly). If I had to give up exmh
for <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/LatencyImpactMyXExperience">latency reasons</a>, I at
least wanted something that was as much like it as possible, including
several idiosyncratic exmh features. This was a somewhat awkward
process, since GNU Emacs doesn't work like exmh, or even like
<a href="https://www.tcl.tk/">Tcl/Tk</a>; there are various things that are
easy and natural in Tcl/Tk that are rather awkward in GNU Emacs,
like buttons and <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/EmacsDynamicMenubarMenus">menus</a>.
I also mostly set MH-E's customizations to work like exmh, for
example preferring the plain text version of email over HTML.</p>
<p>A certain amount of this was, in retrospect, somewhat of a mistake.
I've written some code to imitate exmh behavior as much as possible
that I've never actually used outside of testing, because the
imitation isn't all that good. Some of my substitutes for exmh
features are perhaps unnecessary clutter, even if they work and I
use them sometimes. And <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/EmailToolsAffectMyBehavior">MH-E has its own strengths that have
changed how I read email</a>, which has
led to me modifying how various pieces of my NMH infrastructure
work to improve my experience in MH-E. I've also more or less
adjusted to MH-E's key bindings instead of exmh's. And after I
stopped trying so hard to make MH-E like exmh, some of what I did
was to actively improve MH-E features in a GNU Emacs way (such as
better folder name completion for my tastes, or augmenting existing
MH-E features); I might have been further ahead if I'd started with
them.</p>
<p>(I'm also glad that I didn't try to go further down a rabbit hole
of trying to make Tcl/Tk like buttons and so on in GNU Emacs, which
I was tempted by at one point. Honestly, that was partly decided
by seeing how comparatively unresponsive many GNU Emacs UI elements
already are. Basic GNU Emacs text rendering may be faster than exmh
even with <a href="https://utcc.utoronto.ca/~cks/space/blog/tech/LatencyImpactMyXExperience">latency issues</a>,
but some other things are clearly worse, possibly for internal Emacs
structural reasons.)</p>
<p>On the whole, I think I would have been better off if I had started
out trying MH-E on its own merits and its own terms, rather than
attempting to turn it into an imitation of exmh. There's certainly
things I've added that are useful and that are similar to exmh (and
ideas I've taken from exmh), but there are also things in MH-E that
I passed on for too long because they didn't match my exmh expectations.
For example, MH-E has significantly better rendering of HTML email,
so I no longer set MH-E to prefer plain text parts.</p>
<p>This is probably not the first time I've tried to make a new
environment imitate an old environment, and <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/WaylandMyView2021">it's basically
certain not to be the last time</a>.
Perhaps next time around I can try out the lesson I sort of
learned this time around, and start out by letting the new
environment be itself.</p>
</div>
Maybe learning to let new environments be themselves (more or less)2024-02-26T21:43:53Z2023-12-27T04:01:04Ztag:cspace@cks.mef.org,2009-03-24:/blog/tech/StandardsAndBadContentcks<div class="wikitext"><p>In a comment on <a href="https://utcc.utoronto.ca/~cks/space/blog/spam/SMTPSmugglingConsequences">my entry on what I think SMTP Smuggling enables</a>, <a href="https://leahneukirchen.org/">Leah Neukirchen</a> noted something important, which is
that <a href="https://en.wikipedia.org/wiki/Simple_Mail_Transfer_Protocol">SMTP</a>
messages that contain a CR or a LF by itself aren't legal:</p>
<blockquote><p>I disagree. The first mail server is also accepting a message
with a non-CRLF LF, which violates RFC 5322 <a href="https://datatracker.ietf.org/doc/html/rfc5322#section-2.3">section 2.3</a></p>
<blockquote><p>CR and LF <strong>MUST</strong> only occur together as CRLF; they <strong>MUST NOT</strong>
appear independently in the body.</p>
</blockquote>
</blockquote>
<p>The capitalization in the RFC quote is original, the emphasis is
mine, and <a href="https://datatracker.ietf.org/doc/html/rfc2119">the meaning of these terms is covered in RFC 2119</a>. What it adds up to
is unambiguous at one level; a SMTP message that contains a bare CR
or LF isn't an RFC 2119 compliant message, much like a C program with
undefined behavior isn't a valid ANSI C program.</p>
<p>But just like the ANSI C standard doesn't (as far as I know) put
any requirements on how a C compiler handles a non-ANSI-C program,
RFC 2119 provides no requirements or guidance on what you should
or must do with a non-compliant message. This is quite common in
standards; standards often spell out only what is within their scope
and what must be done with those things. They've historically been
silent about non-standard things, leaving it entirely to the
implementer. When it comes to protocol elements, this generally
means rejecting them (you don't try to guess what unknown SMTP
commands are), but when it comes to things you don't act on like
email message content, things are much fuzzier.</p>
<p>At this point two things often intervene, The first is <a href="https://en.wikipedia.org/wiki/Robustness_principle">Postel's
Law</a>, which
suggests people accept things outside the standard. The second is
that strong standards compliance is often actively inconvenient or
problematic for people using the software. I've lived life behind
a SMTP mailer that had strong feelings about RFC compliance (at
least in some areas), and by and large we didn't like it. Strict
software is often unpopular software, which pushes people writing
software to appeal to Postel's Law in the absence of anything else.
If you don't even have an RFC to point to that says 'you SHOULD
reject this' (or 'you MUST reject this') and you have people banging
on your door wanting you to be liberal, often the squeaky wheel
gets the grease (or has gotten until recently; these days people
are somewhat less enamored of <a href="https://en.wikipedia.org/wiki/Robustness_principle">Postel's Law</a>, for various reasons
including security issues).</p>
<p>(C compilers and their reaction to undefined behavior is a complex
subject, but I don't know of any mainstream compiler that will
actually reject code that has known undefined behavior.)</p>
<p>At this point there's not much we can do here. It's obviously much
too late for existing RFCs and standards that don't have any
requirements or guidance on what you should do about bad contents,
and I'm not sure that people would agree on adding it anyway. People
can attempt to be strict and hope that not much will be affected,
or they can try to write rules about error recovery (which HTML
eventually did in HTML5) to encourage software to all do the same,
agreed-on thing. But these will probably mostly be reactive things,
not proactive ones (so we're probably about to see a wave of SMTP
mailers getting strict in the wake of <a href="https://utcc.utoronto.ca/~cks/space/blog/spam/SMTPSmugglingBackground">SMTP Smuggling</a>).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/tech/StandardsAndBadContent?showcomments#comments">One comment</a>.) </div>Standards often provide little guidance for handling 'bad' content2024-02-26T21:43:52Z2023-12-26T02:44:15Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/AreNegativeAccessRulesNeededcks<div class="wikitext"><p>Every so often, someone has the great idea to simplify how to specify
access controls; they will make their overall policy to deny access
by default, and then have all of their access control rules be
positive ones. My reaction to date to these systems is that they've
just made my life harder, and I was all set to write an entry about
it except that as I started writing out the details I realized that
I wasn't sure I had a convincing case when negative access rules
made things much less convoluted.</p>
<p>In general purpose firewalls, there are definitely cases where you
need negative access rules in order to not make things extremely
convoluted. If you want to allow all of the Internet access to port
80 and 443 on your web server, except for a collection of IP addresses
that you've determined are abusing it, you can in theory write this
as a set of positive only rules, but you will be quite annoyed at
having to break up the entire IP address space into a set of subnets
that amount to 'everything but these IP addresses'.</p>
<p>In a more restricted setting, such as <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/VPNMeshAppeal">a mesh-capable VPN</a> where you're trying to write out who will have
access to what, I'm now not so sure. It feels like negative entries
would make some sorts of rules easier, so you could easily say
things like 'everyone has full access to their own devices and
system administrators have full access to everything except people's
own devices', but you can solve this by enumerating what 'everything'
is here for system administrators, and arguably you should do so
because being explicit is better (and you might discover cases where
you don't want sysadmins to have full access after all, which you
wouldn't have spotted if you'd been able to sweep it under the rug).
To some extent this also depends on how much the ACL system allows
you to group things, which in a peculiar way is somewhat like the
case of blocking IP addresses on firewalls.</p>
<p>Of course, I would prefer to have more power in access control rules
than less power, so in practice I always want both positive and
negative rules and positive and negative matches. And I suspect
that you can always mechanically translate a set of positive and
negative rules into a set of positive only rules, although possibly
a quite verbose one. I also believe that negative rules let people
more directly express what they want; every 'except' you write into
a high level description of what you want is a negative rule wanting
to be written.</p>
<p>(This is one of those entries that wound up going in a completely
different direction than I expected when I started writing it. Possibly
I will wind up finding or being told about a counterexample, too,
showing that we really do need negative rules even in relatively
restricted contexts.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/AreNegativeAccessRulesNeeded?showcomments#comments">3 comments</a>.) </div>Do we actually need negative access control rules (in general)?2024-02-26T21:43:53Z2023-12-25T04:27:06Ztag:cspace@cks.mef.org,2009-03-24:/blog/spam/DKIMAloneMeansLittlecks<div class="wikitext"><p>In yesterday's entry on <a href="https://utcc.utoronto.ca/~cks/space/blog/spam/SMTPSmugglingConsequences">what I think the SMTP Smuggling attack
enables</a>, I casually said that you were
safe if you ignored <a href="https://en.wikipedia.org/wiki/Sender_Policy_Framework">SPF</a> results and
only paid attention to <a href="https://en.wikipedia.org/wiki/DomainKeys_Identified_Mail">DKIM</a>. As
sometimes happens, this was my thoughts eliding some important
qualifications that I just take as given when talking about DKIM,
but that I should spell out. The most important qualification is
that <strong>a (valid) DKIM signature by itself means almost nothing</strong>,
which is a bit unlike how SPF works.</p>
<p>First off, <a href="https://utcc.utoronto.ca/~cks/space/blog/spam/DKIMWithDMARCNotes">anyone can DKIM sign a message</a>,
provided that they control a bit of DNS (you could probably even
do it in a mail client). Quite a lot of people, including spammers,
can even DKIM sign email that is <a href="https://utcc.utoronto.ca/~cks/space/blog/spam/DMARCAndDKIMAlignment">'aligned'</a>
with the 'From:' header, which means that the DKIM signature is
from the From: domain, not just from some random domain. A valid
DKIM signature does provide <a href="https://utcc.utoronto.ca/~cks/space/blog/spam/DKIMProvidesAttribution">definite attribution</a>, and if it's for the From: domain, it
more or less identifies who authorized the mail. Also, in practice
<a href="https://utcc.utoronto.ca/~cks/space/blog/spam/DKIMAsSignal">lack of a DKIM signature is itself a signal</a>, because
<a href="https://utcc.utoronto.ca/~cks/space/blog/spam/DKIMSigningMostlyMandatory">an increasing number of places more or less require a DKIM signature</a>, sometimes one that is from the From:
domain.</p>
<p>(However, some people only have SPF records and <a href="https://utcc.utoronto.ca/~cks/space/blog/spam/BlockingAutomaticForwarding">this can be
deliberately used to create email that can't be easily forwarded</a>.)</p>
<p>A valid DKIM signature for the From: domain is at least as strong
a sign as an SPF pass result. However, this doesn't mean that the
email is any good, any more than an SPF pass does; spammers can and
do pass both checks. Similarly, lack of a valid DKIM signature for
the From: domain doesn't mean that it's not from that domain. To
have some idea of that you need to check the domain's <a href="https://dmarc.org/">DMARC</a> policy. In effect, the equivalent of SPF is
the combination of DKIM and DMARC (or something like it).</p>
<p>So when I casually wrote about (only) paying attention to DKIM, I
was implicitly thinking of using DKIM along with something else to
tell you when DKIM results matter. This might be specific knowledge
of which important domains you deal with DKIM sign their email
(including your own domain), or it might mean checking DMARC, or
both. And of course you can ignore both SPF and DKIM signatures,
apart perhaps from logging DKIM results.</p>
<p>(<a href="https://support.cs.toronto.edu/">We</a> don't explicitly use DKIM
signatures and DMARC in our Exim configuration, but these days we
use <a href="https://rspamd.com/">rspamd</a> for spam scoring and I think it
makes some use of DKIM and perhaps DMARC.)</p>
</div>
A DKIM signature on email by itself means very little2024-02-26T21:43:53Z2023-12-24T03:49:08Ztag:cspace@cks.mef.org,2009-03-24:/blog/spam/SMTPSmugglingConsequencescks<div class="wikitext"><p>The very brief summary of <a href="https://utcc.utoronto.ca/~cks/space/blog/spam/SMTPSmugglingBackground">SEC Consult's "SMTP Smuggling" attack</a> is that under the right circumstances,
it allows you (the attacker) to cause one mail server to 'submit'
an email with contents and SMTP envelope information that you provide
to a second mail server. To the second email server, this smuggled
email will appear to have come from the first mail server (because
it did), and can inherit some of the authentication the first mail
server has.</p>
<p>(It's important to understand that the actual vulnerability is in
the second mail server, not the first one; the first one can and
often must be completely RFC compliant in its behavior.)</p>
<p>The obvious authentication that the smuggled email inherits is <a href="https://en.wikipedia.org/wiki/Sender_Policy_Framework">SPF</a>, because
that's based on the combination of the sending IP (the first mail
server) and the SMTP envelope sender (and possibly message From:),
which is under your control. So you can put in a SMTP envelope
sender (and a From:) that claims to be 'from' the first mail server,
and the second mail server will accept it as authentic.</p>
<p>(An almost as obvious thing is that the smuggled email gets to share
in whatever good reputation the sending email server has with the
receiver. This is most useful if you can get a big, high reputation
mail system to be the first server, <a href="https://sec-consult.com/blog/detail/smtp-smuggling-spoofing-e-mails-worldwide/">which is possible</a>
(or perhaps 'was' by the time you're reading this).)</p>
<p>If you forge email as being from something that has a <a href="https://en.wikipedia.org/wiki/DMARC">DMARC</a> policy that passes the policy
if SPF passes, you can also get your forged email to pass DMARC
checks. The same is true if the second email server happens to be
something that imposes its own implicit DMARC-like policy that
accepts email if SPF passes and (and possibly that SPF is <a href="https://utcc.utoronto.ca/~cks/space/blog/spam/DMARCAndDKIMAlignment">'aligned'</a> with the From: message address).</p>
<p>What you can't fully do is inherit <a href="https://en.wikipedia.org/wiki/DomainKeys_Identified_Mail">DKIM</a>
authentication. You can add your own valid DKIM headers to your
smuggled email, but you can only do this for domains with DNS under
your control (or domains where you've managed to obtain the DKIM
signing keys). This probably doesn't include the first email server
and its domain, and because the first email server doesn't recognize
your smuggled email as an actual email message, it won't DKIM sign
the email for you. The only way you can get the domain of the first
email server to DKIM sign your second email for you is if the second
email server is also an internal one belonging to the same domain
and it will DKIM sign outgoing messages. This general configuration
is reasonably common (incoming and outgoing email servers are often
different), but usually they run the same mail software and so they
won't have the different interpretations of the email message(s)
that <a href="https://sec-consult.com/blog/detail/smtp-smuggling-spoofing-e-mails-worldwide/">SMTP Smuggling</a>
needs.</p>
<p>The result of this is that if the second (receiving) email server
doesn't check SPF results and only pays attention to DKIM (<a href="https://utcc.utoronto.ca/~cks/space/blog/spam/DKIMSigningMostlyMandatory">which
is increasingly mandatory in practice</a>),
it's almost completely safe from SMTP Smuggling even if it accepts
things other than 'CR LF . CR LF' as the email message terminator.
Since <a href="https://utcc.utoronto.ca/~cks/space/blog/spam/BlockingAutomaticForwarding">SPF breaks things</a> (<a href="https://utcc.utoronto.ca/~cks/space/blog/spam/AnInternetRule">also</a>), this is what I feel you should already be doing.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/spam/SMTPSmugglingConsequences?showcomments#comments">3 comments</a>.) </div>What I think the 'SMTP Smuggling' attack enables2024-02-26T21:43:53Z2023-12-23T02:49:58Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdStallAfterTooFastRestartscks<div class="wikitext"><p>Over on the Fediverse, <a href="https://mastodon.social/@cks/111619851045624217">I said something</a>:</p>
<blockquote><p>Recently I learned that if you manually restart a systemd service too
often (with 'systemctl restart ...'), systemd will by default stop
starting it:</p>
<pre style="white-space: pre-wrap;">
<x>.service: Start request repeated too quickly.
<x>.service: Failed with result 'start-limit-hit'.
Failed to start <x>.service - Whatever it is.
</pre>
<p>Why would you do that, you ask? Well, consider scripts that update
some data file and do a 'systemctl restart ...' to make the daemon
notice it. Now try to do a bunch of updates all at once.</p>
</blockquote>
<p>The traditional way to have systemd stop starting a service is <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdRestartUseDelay">for
it to have a 'Restart=' setting with no restart delay, and then to
fail on startup</a>. Sometimes it's failing on
start because your machine is out of memory; sometimes it's because
you've made an error in its configuration files.
However, if you read the actual documentation for <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#StartLimitIntervalSec=interval">StartLimitIntervalSec
and StartLimitBurst</a>,
they don't say they're limited to the 'Restart=' case. Here's what they
say, emphasis mine:</p>
<blockquote><p>Configure unit start rate limiting. <strong>Units which are started more
than <em>burst</em> times within an <em>interval</em> time span are not permitted to
start any more.</strong> [...]</p>
<p>These configuration options are particularly useful in conjunction
with the service setting <code>Restart=</code> (see <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html">systemd.service(5)</a>);
however, <strong>they apply to all kinds of starts (including manual)</strong>, not
just those triggered by the Restart= logic.</p>
</blockquote>
<p>The way you clear this condition is also sort of mentioned in that
section of the manual page; '<code>systemctl reset-failed</code>' will reset
this counter and allow you to immediately (re)start the unit again.
If you want, you can restrict the resetting to just your particular
unit.</p>
<p>The default limits for this rate limiting are likely visible in the
commented out default values in <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd-system.conf.html">/etc/systemd/system.conf</a>.
The normal standard values are five restarts in ten seconds (<a href="https://www.freedesktop.org/software/systemd/man/latest/systemd-system.conf.html#DefaultStartLimitIntervalSec=">cf</a>)
and it appears that neither Fedora nor Ubuntu change these defaults,
so that's probably what you'll see.</p>
<p>You might wonder how you get yourself into this situation in the
first place. Suppose that <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SystemEvolution">you have a script to add an entry to a
DHCP configuration file</a>, which as
part of activating the entry has to restart the DHCP server (because
it doesn't support on the fly configuration reloading). Now suppose
you have a bunch of entries to add; you might write a script (or a
for loop) to effectively bulk add them as fast as the commands can
run. When you run that script, you'll be restarting the DHCP server
repeatedly, as fast as possible, and it won't take too long before
you trigger systemd's default limit (since all you need with the
default limits is to go through the whole thing in less than two
seconds per invocation).</p>
<p>If you're doing this in a script, the two solutions I see are to
always make the script sleep for three seconds or so after a restart,
or to run 'systemctl reset-failed <service>' either at the end of
the script or before you start doing any 'systemctl restart's.</p>
<p>(I'm not sure which of these we'll adopt.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdStallAfterTooFastRestarts?showcomments#comments">3 comments</a>.) </div>Systemd will block a service's start if you manually restart it too fast2024-02-26T21:43:53Z2023-12-22T03:44:58Ztag:cspace@cks.mef.org,2009-03-24:/blog/spam/SMTPSmugglingBackgroundcks<div class="wikitext"><p>The recent email news is SEC Consult's <a href="https://sec-consult.com/blog/detail/smtp-smuggling-spoofing-e-mails-worldwide/">SMTP Smuggling - Spoofing
E-Mails Worldwide</a>
(<a href="https://mastodon.social/@campuscodi/111607963579342899">via</a>),
which <a href="https://mastodon.social/@cks/111608436688958543">I had a reaction to</a>. I found the
article's explanation of SMTP Smuggling a little hard to follow,
so for reasons that don't fit within the scope of today's entry,
I'm going to re-explain the central issue in my own way.</p>
<p><a href="https://en.wikipedia.org/wiki/Simple_Mail_Transfer_Protocol">SMTP</a>
is a very old Internet protocol, and like a variety of old Internet
protocols it has what is now an odd and unusual core model. Without
extensions, everything in SMTP is line based, with the sender and
receiver exchanging a series of 7-bit ASCII lines for commands,
command responses, and the actual email messages (which are sent
as a block of text in the 'DATA' phase, ie after the sender has
sent a 'DATA' SMTP command and the receiver has accepted it). Since
SMTP is line based, email messages are also considered to be a
series of lines, although the contents of those lines is (mostly)
not interpreted. SMTP needs to signal the end of the email text
being transmitted, and as a line based protocol it does this by a
special marker line; a '.' on a line by itself marks the end of the
message.</p>
<p>(In theory there's a defined quoting and de-quoting process if an
actual line of the message starts with a '.'; see <a href="https://datatracker.ietf.org/doc/html/rfc821">RFC 821</a> section 4.5.2, which
is still there basically intact in <a href="https://datatracker.ietf.org/doc/html/rfc5321#section-4.5.2">RFC 5321 section 4.5.2</a>. In
practice, actual mailer behavior has historically varied.)</p>
<p>When you have a line based protocol you must decide how the end of
lines are marked (the <em>line terminator</em>). In SMTP, the official
line terminator is the two byte (two octet) sequence 'CR LF', because
this was the fashion at the time. This includes the lines that are
part of the email message that is sent in the DATA phase, and so
the last five octets sent at the end of a standard compliant SMTP
message are 'CR LF . CR LF'. The first 'CR LF' is the end of the
last line of the actual message, and then '. CR LF' makes up the
'.' on a line by itself.</p>
<p>(This means that all lines of the message itself are supposed to
be terminated with 'CR LF', regardless of whatever the native line
terminator is for the systems involved. If you're doing SMTP properly,
you can't just blast out or read in the raw bytes of the message,
even apart from <a href="https://datatracker.ietf.org/doc/html/rfc5321#section-4.5.2">RFC 5321 section 4.5.2</a> concerns. There are
various ESMTP extensions that can change this.)</p>
<p>Unfortunately, SMTP's definition makes life quite inconvenient for
systems that don't use CR LF as their native line ending, such as
Unix (which uses just LF, \n). Because SMTP considers the email
message itself to be a sequence of lines (and there's a line length
limit), a Unix SMTP mailer has to keep translating all of the lines
in every email message it sends or receives back and forth between
lines ending in \n (the native format) and \r\n (the SMTP wire
format). Doing this translation raises various questions about what
you should send if you encounter a \r (or a \r\n) in a message as
you send it, or encounter a bare \n (or \r) in a message as you
receive it. It also invites shortcuts, such as turning \r\n into
\n as you read data and then dealing with everything as Unix lines.</p>
<p>Partly for this reason and partly because CR LF line endings make
various people grumpy, there has been somewhat of a tradition of
mailers accepting other things as line endings in SMTP, not just CR
LF. Historically a variety of Unix mailers accepted just LF, and I
believe that some mailers have accepted just CR. Even today, finding
SMTP listeners that absolutely require 'CR LF' as the line ending on
SMTP commands isn't entirely common (GMail's SMTP listener doesn't,
for example, although possibly this will cause it to be unhappy with
your email, and I haven't tested its behavior for message bodies). As a
result, such mailers can accept things other than 'CR LF . CR LF' as the
SMTP DATA phase message terminator. Exactly what a mailer accepts can
vary depending on how it implemented things.</p>
<p>(For instance, a mailer might turn '\r\n' into '\n' and accept '\n'
as a line terminator, but only after checking for a line that was
an explicit '. CR LF'. Then you could end messages with 'LF . CR
LF', without the initial 'CR'; the bare LF would be taken as the
line terminator for the last data line, then you have the '. CR LF'
of the official terminator sequence. But if you sent 'LF . LF',
that wouldn't be recognized as the message terminator.)</p>
<p>This leads to the core of SMTP Smuggling, which is embedding an
improper SMTP message termination in an email message (for example,
'LF . LF'), then after it adding SMTP commands and message data to
submit another message (the <em>smuggled message</em>). To make this do
anything useful we need to find a SMTP server that will accept our
message with the embedded improper terminator, then send the whole
thing to another mail server that will treat the improper terminator
as a real terminator, splitting what was one message into two, sent
one after the other. The second mail server will see the additional
mail message as coming from the first mail server, although it
really came from us, and this may allow us to forge message data
that we couldn't otherwise.</p>
<p>(There are various requirements to make this work; for example, the
second mail server has to accept being handed a whole block of SMTP
commands all at once. These days this is a fairly common thing due
to an ESMTP extension for 'pipelining', and also because SMTP
receivers have to do extra work to detect and reject getting handed
a block of stuff like this. See <a href="https://sec-consult.com/blog/detail/smtp-smuggling-spoofing-e-mails-worldwide/">the original article</a>
for the gory details and an extended discussion.)</p>
<p>What you can do with SMTP Smuggling in practice has some limitations
and qualifications, but that's for another entry.</p>
</div>
The (historical) background of 'SMTP Smuggling'2024-02-26T21:43:53Z2023-12-21T03:55:44Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/GrubUnknownFilesystemWhycks<div class="wikitext"><p>Over on the Fediverse, <a href="https://mastodon.social/@cks/111604177856526444">I said something</a>:</p>
<blockquote><p>I hope that the Grub developers will someday fix grub-install so that
the "unknown filesystem" error is replaced with a better one, like
"Grub doesn't have the driver(s) necessary to use your / (or /boot)
filesystem" or even "Grub doesn't currently support some filesystem
features that are enabled on your / (or /boot) filesystem". Ideally
with the right filesystem name.</p>
<p>This has certainly been coming up and getting forum/etc answers for
long enough. But alas.</p>
</blockquote>
<p>Actually fixing the message to be accurate is difficult because of
how Grub's code is structured. The simplest improvement is to change
the text of the message to "unknown filesystem or filesystem with
unsupported features", which at least hints at the potential issue
(although the message would have to be re-translated into various
languages and so on, so perhaps the Grub developers would be
unenthused).</p>
<p>This message can be produced either by grub-install, running on a
booted system, or by the Grub bootloader code itself, as you boot
the system. Normally it's seen when you run grub-install, which is
somewhat puzzling; how is the filesystem unknown when the kernel
is using it? And why does grub-install care?</p>
<p>When Grub is booting your system, it doesn't (and can't) use the
Linux kernel's filesystem code and device drivers (or any Unix
kernel's code; Grub runs in non-Linux environments as well). At the
same time, Grub wants to read various things from your filesystems,
such as its menu file or your kernel (and on Linux, initramfs). To
do this, Grub has its own collection of <a href="https://git.savannah.gnu.org/gitweb/?p=grub.git;a=tree;f=grub-core/fs">filesystem code</a> and
<a href="https://git.savannah.gnu.org/gitweb/?p=grub.git;a=tree;f=grub-core/disk">software disk drivers</a>,
generally in a collection of loadable (Grub) modules. When grub-install
runs, one of its jobs is to prepare the set of filesystem and disk
driver modules Grub will need at boot time. Its report of "unknown
filesystem" means that it can't find a filesystem module that will
accept the filesystem that you have things on (generally either the
root filesystem or your /boot filesystem, depending on whether /boot
is on its own filesystem).</p>
<p>The specific message is generated in grub_fs_probe() in <a href="https://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/kern/fs.c">kern/fs.c</a>.
This function is handed a 'grub device' and runs through grub's
list of known filesystem modules, asking each one of them in turn
if they can handle the filesystem on the 'grub device'. Currently,
filesystem modules return the same error code if the device isn't
their type of filesystem or if it's their type of filesystem but
it has filesystem features that the Grub module doesn't (yet)
support. The filesystem module can set a specific error message
here (in addition to its error code), but grub_fs_probe() doesn't
normally report the per filesystem error messages unless (the right
sort of) debugging is turned on (this can be done in grub-install
with '-vv', although that enables all debugging messages and produces
a lot of messages). Instead, if all filesystem modules say they
can't handle the filesystem, grub_fs_probe() reports a generic
"unknown filesystem" error. One level up, <a href="https://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=util/grub-install.c">grub-install.c</a>
calls grub_fs_probe() (in a couple of different places) and
then reports the error message that it's produced (if it failed).</p>
<p>Fixing this to return an exact error message about what's wrong is
at least a little bit tricky and would make the code more complicated.
It also touches a relatively critical piece of Grub, since this
code is also run during boot (and must properly accept the filesystem
then). So I suspect the most that Grub developers would do is change
the message to a longer version that mentions the possibility of
feature flag mismatches.</p>
</div>
Why grub-install can give you an "unknown filesystem" error2024-02-26T21:43:53Z2023-12-20T03:44:41Ztag:cspace@cks.mef.org,2009-03-24:/blog/programming/GoKeepsConstantVariablescks<div class="wikitext"><p>Recently I wrote about <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GoPartialIfdefWithConsts">partially emulating #ifdef with build tags
and consts</a>, exploiting Go's support for
<a href="https://en.wikipedia.org/wiki/Dead-code_elimination">dead code elimination</a>, and I said
that this technique didn't work with variables. That's actually a
somewhat interesting result. To see how it is, let's start with a
simple Go program, where the following code is the entire program:</p>
<blockquote><pre style="white-space: pre-wrap;">
package main
import "fmt"
var doThing bool
func main() {
fmt.Println("We may or may not do the thing.")
if doThing {
fmt.Println("We did the thing.")
}
}
</pre>
</blockquote>
<p>Here, '<code>doThing</code>' is a boolean variable that is left at a zero value
(false), and isn't exported on top of being in the 'main' package.
There's nothing in the Go specification that allows the false value
of '<code>doThing</code>' to ever change. Despite this, if you inspect the
resulting code the <code>if</code> and its call to '<code>fmt.Println()</code>' is still
present. If you go in with <a href="https://github.com/go-delve/delve">a debugger</a>
and manually set <code>doThing</code> to true, this code will run.</p>
<p>If you feed a modern C compiler a similar program, with '<code>doThing</code>'
declared as a static int, what you get back is code that has optimized
out the code guarded by '<code>doThing</code>'. The C compiler knows that the
rules of <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/CAsAbstractMachine">the C abstract machine</a> don't permit
'<code>doThing</code>' to change, so it has optimized accordingly. Functionally
your '<code>static int doThing;</code>' is now a constant, so the C compiler
has then proceeded to do <a href="https://en.wikipedia.org/wiki/Dead-code_elimination">dead code elimination</a>. The C compiler
doesn't care that you could, for example, go in with a debugger and
want to change the value of '<code>doThing</code>', because the existence of
debuggers is not included in the C abstract machine.</p>
<p>(This focus of C optimization on the C abstract machine and nothing
beyond it is somewhat controversial, to put it one way.)</p>
<p>Go could have chosen to optimize this case in the same way as C
compilers do, but for whatever reasons the Go developers didn't
choose to do so. One possible motivation to not do this is the
case of debuggers, where you can manually switch '<code>doThing</code>' on at
runtime. Another possible motivation is simply to speed up compiling
Go code and to keep the compiler simpler. A C compiler needs a
certain amount of infrastructure so that it knows that the static int
'<code>doThing</code>' never has its value changed, and then to propagate that
knowledge through code generation; Go doesn't.</p>
<p>Well actually that's a bit of a white lie. The normal Go toolchain
doesn't do all of this with these constant variables, but there's
also <a href="https://go.dev/doc/install/gccgo">gccgo</a>, a Go implementation
that's a frontend for GCC (along side C, C++, and some others).
Since gccgo is built on top of GCC, it can inherit all of GCC's C
focused optimizations, such as recognizing constant variables, and
if you invoke gccgo with the optimization level high enough, <a href="https://godbolt.org/z/zY4dYn7sf">it
will optimize the '<code>doThing</code>' guarded expression out just like C</a> (this omits the first call to
fmt.Println to make the generated code slightly clearer).</p>
<p>(There have been some efforts to build a Go toolchain based on
<a href="https://llvm.org/">LLVM</a>, and I'd expect such a toolchain to
also optimize this Go code the way gccgo does.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GoKeepsConstantVariables?showcomments#comments">One comment</a>.) </div>In Go, constant variables are not used for optimization2024-02-26T21:43:52Z2023-12-19T02:09:55Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/PrometheusGroupLeftAndRightNotescks<div class="wikitext"><p>I'll start with the motivating story. Suppose, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/UsingBindNowForResolvers">not hypothetically</a>, that you have some Bind nameservers and
<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">a Prometheus environment</a>, so you're
monitoring those nameservers with <a href="https://github.com/prometheus-community/bind_exporter">the Bind exporter</a>. One thing
the Bind exporter does is provide the DNS SOA serial number for
every zone Bind is configured to be a primary or a secondary for.
If you have a primary and some internal secondaries (as we do),
you'd like to be sure that your secondaries have the same DNS SOA
serial numbers as your primary does. Writing an alert expression
for this requires using one of <a href="https://prometheus.io/docs/prometheus/latest/querying/basics/">PromQL</a>'s
matching operators for <a href="https://prometheus.io/docs/prometheus/latest/querying/operators/#many-to-one-and-one-to-many-vector-matches">many-to-one matching</a>,
since you have more than one secondary and one primary. However and
speaking from recent personal experience, it's surprisingly easy
to gloss over the details of the expression you want, especially
if you start out with only one secondary. Since I've now stubbed
my toes on this repeatedly, I'm going to write down in one spot the
matrix of possibilities.</p>
<p>To save my future self some reading, here is the actual matrix
that's explained in the rest of this entry, with the note that
labels normally come from the 'many' side, whichever that is.</p>
<table class="wikitable" border="1" cellpadding="4"><tr><td valign="top">extra labels?</td>
<td valign="top">'many' on the left side</td>
<td valign="top">'many' on the right side</td>
</tr>
<tr><td valign="top">none from 'one' side</td>
<td valign="top">group_left(notpresent)</td>
<td valign="top">group_right(notpresent)</td>
</tr>
<tr><td valign="top">some from 'one' side</td>
<td valign="top">group_left(label1, …)</td>
<td valign="top">group_right(label1, …)</td>
</tr>
</table>
<p>The 'notpresent' can be any label name that's not actually present;
<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGoodDownExporterAlert">I use 'notpresent' for clarity</a>.
When adding extra labels, you don't include (and can't include) any
that you've used in an '<code>on()</code>', since those are not extra; they're
already known to be the same between both sides.</p>
<p>First off, your choice between <code>group_left()</code> and <code>group_right()</code>
is determined by which side is the 'many' side. If the left side
is the many side, you use <code>group_left()</code>; if the right side
is the many side, you use <code>group_right()</code>. Often the choice of
which side is which will be determined by which side's value you
want to use, because the value is the one thing you can't take from
the other side. If you get the side wrong (or you could say the
direction of the match wrong), you get the classical error:</p>
<blockquote><p>Error executing query: found duplicate series for the match group
[...] on the left hand-side of the operation: [...] many-to-many
matching not allowed: matching labels must be unique on one side</p>
</blockquote>
<p>(This is an error from when I used group_right() and should
have used group_left(). It will say 'right hand-side' if it's
the other way around. If you start out with what's currently a one
to one match (because you only have one DNS resolver running Bind
so far), you can have this error lurk unnoticed for a while.)</p>
<p>In my DNS SOA serial number alert, the 'many' side is the left hand
side because I want the alert to include the incorrect SOA serial
that the DNS secondary has. In <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusStaleMetricsOverTime">a much earlier alert on disk space
that used group_right()</a>, the
many side was the right hand side, because I wanted the alerts about
low space on filesystems to mention the filesystem's current space
(a 'one' metric) instead of the alert level for who was getting
alerted (a 'many' metric when <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGroupLeftHack">joined against the filesystem</a>).</p>
<p>The second choice is whether you want any extra labels from the
'one' side. With group_left() this is the right side, and with
group_right() it's the left side. In theory this sounds symmetric,
but in practice it's not, because if you're forced to use
group_right(), by itself <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusAlertsWhereLabels">your alert labels won't come from
the metric whose value generated the alert</a>.
The value comes from the left side metric, but by default all the
labels will come from the right side metric and you'll have to
explicitly pull in all of the left side labels you may care about
for generating alert messages.</p>
<p>(If you're <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusPullingInLabels">using group_*() in order to pull in extra labels
from the right hand side of a one to one match</a>,
this is why you want to use group_left() instead of group_right();
it automatically preserves all of the labels of your left side metric.)</p>
<p>Pulling in labels from the 'one' metric provides an opportunity to
make an interesting mistake, which I've done in our Bind DNS SOA
serial alert. Suppose that you start off with the wrong group_*()
operator, but it works because you currently only have one metric
set from your one DNS resolver running Bind. In this case, the
labels will be wrong, so you'll stick them in from the other side:</p>
<blockquote><pre>
bind_zone_serial{..} != on (zone_name) \
group_right(host, instance, ...) \
bind_zone_serial{ host="primary", view="internal" }
</pre>
</blockquote>
<p>When you bring up your second DNS resolver running Bind, this will give
you the error from above, and you may react by switching to the other
group_*() operator. This will give you a different error:</p>
<blockquote><p>Error executing query: multiple matches for labels: grouping
labels must ensure unique matches.</p>
</blockquote>
<p>This error is happening because you overwrote the unique labels
from the 'many' side with labels from the 'one' side, which after
many to one matching aren't necessarily unique any more. If both
of your DNS resolvers have the wrong SOA for some zone (or you
flipped the '!=' to '==' to test the alert), this gives you non-unique
labels in the time series generated. This error took me some time to
understand when I made it.</p>
<p>This is also why the labels come from the 'many' side, instead of
always coming from the left side, like the value. Only the 'many'
side is guaranteed to produce unique labels across all of the series
produced.</p>
</div>
Prometheus's <code>group_left()</code> and <code>group_right()</code> operators2024-02-26T21:43:53Z2023-12-18T03:50:21Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/OIDCThreeEmailAddressescks<div class="wikitext"><p>One of the popular forms of web Single Sign On (SSO) systems is
<a href="https://en.wikipedia.org/wiki/OpenID#OpenID_Connect_(OIDC)">OpenID Connect (OIDC)</a>. OIDC
has multiple components and is normally used with email addresses,
or at least things that look like them, in the form of '<user>@<domain>'.
Since there are multiple components, it's possible for components
to not agree on these 'email addresses'. If you set up a proper
<a href="https://utcc.utoronto.ca/~cks/space/blog/web/MappingOutSSOAuthentication">OIDC Identity Provider</a> using
your proper email addresses, you probably won't have to worry about
this, because everything will be handled by your proper software and
likely fit together nicely. If you're quickly assembling <a href="https://mastodon.social/@cks/111525389654999012">a discount
OIDC environment</a>,
you may wind up stumbling over this.</p>
<p>The first OIDC email address is the address that people will put
into the 'your identity' box in whatever website wants to use OIDC
authentication against your OIDC IdP. In normal OIDC usage, the
domain part of this address will be used to do a <a href="https://en.wikipedia.org/wiki/WebFinger">WebFinger</a> query, which will be
expected to return information for <a href="https://www.rfc-editor.org/rfc/rfc7033#section-3.1">OIDC Identity Provider
discovery</a>.
Many OIDC applications probably also expect to be able to send
email to these email addresses.</p>
<p>This means that if the email address you're using for OIDC is
'fred@test.example.org', there must be a 'test.example.org' HTTPS
web server that answers WebFinger requests. If you want to use
different OIDC IdPs with different web applications for some reason,
you're going to need a different (sub)domain and a different virtual
web server for each of them. Conversely, if you want to use
'fred@example.org' in an OIDC test, you need the 'example.org' web
server to answer WebFinger requests for your test users.</p>
<p>(This need is one factor that may push you to using 'test.example.org'
or things like it in your discount OIDC setup.)</p>
<p>The second OIDC email address is in the JSON that WebFinger is
expected to return:</p>
<blockquote><pre style="white-space: pre-wrap;">
{
"subject" : "acct:fred@test.example.org",
[...]
}
</pre>
</blockquote>
<p>Some or many OIDC applications will expect the 'subject' field of
this JSON to match the email address that the person entered, so
they can be sure they're getting accurate OIDC IdP information for
this account. If you accidentally make your discount WebFinger
implementation return some other email address, you lose. This means
you can't just return a single static WebFinger result for all
(valid) users, since the OIDC 'email address' must be correct and
it will be different for each person.</p>
<p>The third OIDC email address is in the information that your OIDC
IdP will probably return (<a href="https://connect2id.com/learn/openid-connect">cf</a>). How your OIDC IdP
gets this information itself can vary, but if you're using <a href="https://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol">LDAP</a>, your
IdP will probably try to get people's email addresses from that.
If you've set up your IdP with its own LDAP server, you can give
people whatever email domain you want them to have. However, if
you're reusing your organization's normal LDAP server, it will
probably tell you that people have their regular organizational
email address, which is perhaps not 'fred@test.example.org'. Your
OIDC IdP may or may not provide you a way to remap people's email
addresses or manually set and override them.</p>
<p>(OIDC applications pretty much have to verify that the email address
they got back from your OIDC IdP matches what you put in, because
this is their protection against you claiming to be one person but
authenticating to your OIDC IdP as another one.)</p>
<p>I'm writing this down because when I set up <a href="https://mastodon.social/@cks/111525389654999012">my discount OIDC
environment</a> I
stubbed my toes on each of these additional two places. I first
got the email addresses wrong in WebFinger results, and then in
the OIDC IdP results (which were initially just taking the email
address from our regular LDAP servers).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OIDCThreeEmailAddresses?showcomments#comments">2 comments</a>.) </div>The three email addresses of OpenID Connect (OIDC) in practice2024-02-26T21:43:53Z2023-12-17T03:12:10Ztag:cspace@cks.mef.org,2009-03-24:/blog/web/WellKnownQueriesAgainstUscks<div class="wikitext"><p><a href="https://en.wikipedia.org/wiki/WebFinger">WebFinger</a> is a general
web protocol for obtaining various sorts of information about
'people' and things, including <a href="https://www.rfc-editor.org/rfc/rfc7033#section-3.1">someone's OpenID Connect (OIDC)
identity provider</a>.
For example, if you want to find things out about 'brad@example.org',
you can make a HTTPS query to example.org for
/.well-known/webfinger?resource=acct%3Abrad%40example.org and see
what you get back. WebFinger is on my mind lately as part of <a href="https://utcc.utoronto.ca/~cks/space/blog/web/MappingOutSSOAuthentication">me
dealing with OIDC and other web SSO stuff</a>,
so I became curious to see if people out there (ie, spammers) were
trying to use it to extract information from <a href="https://support.cs.toronto.edu/">us</a>.</p>
<p>As we can see, WebFinger is just one of a number of things that use
'/.well-known/<something>'; another famous one is <a href="https://letsencrypt.org/">Let's Encrypt</a>'s HTTP based challenge (<a href="https://letsencrypt.org/docs/challenge-types/">HTTP-01</a>), which looks for
/.well-known/acme-challenge/<TOKEN> (over HTTP, not HTTPS, although
I believe it accepts HTTP to HTTPS redirects). So I decided to look
for general use of /.well-known/ to see what came up, and to my
surprise there was rather more than I expected.</p>
<p>The official registry for this is <a href="https://www.iana.org/assignments/well-known-uris/well-known-uris.xhtml">Well-Known URIs</a> at
<a href="https://www.iana.org/">IANA</a>. On the web server for our normal
email domain (which is not <a href="https://www.cs.toronto.edu/">our web server</a>),
by far the common query was for '/.well-known/carddav', documented
in <a href="https://www.rfc-editor.org/rfc/rfc6764.html">RFC 6764</a>. After
that I saw some requests for '/.well-known/openpgpkey/policy', which
is covered <a href="https://wiki.gnupg.org/WKDHosting">here</a> and less
clearly <a href="https://datatracker.ietf.org/doc/html/draft-koch-openpgp-webkey-service">here</a>,
but which isn't an officially registered thing yet. Then there were
a number of requests for '/.well-known/traffic-advice' from "Chrome
Privacy Preserving Prefetch Proxy". This too isn't officially
registered and is sort of documented <a href="https://github.com/buettner/private-prefetch-proxy/blob/main/traffic-advice.md">here</a>
(and <a href="https://buettner.github.io/private-prefetch-proxy/traffic-advice.html">here</a>),
<a href="https://webmasters.stackexchange.com/questions/138033/what-is-well-known-traffic-advice-directory">in this question and answers</a>,
and in <a href="https://guillermodlpa.com/blog/traffic-advice-well-known-file-stop-404-errors-nextjs-app">this blog entry</a>.
Apparently this is a pretty recent thing, probably dating from
August 2023. Somewhat to my surprise, I couldn't see any use of
WebFinger across the past week or so.</p>
<p>On <a href="https://www.cs.toronto.edu/">our actual web server</a>, the picture
is a bit different. The dominant query is for '/.well-known/traffic-advice',
and then after that we get what look like security probes for several URLs:</p>
<blockquote><pre style="white-space: pre-wrap;">
/.well-known/class.api.php
/.well-known/pki-validation/class.api.php
/.well-known/pki-validation/cloud.php
/.well-known/pki-validation/
/.well-known/acme-challenge/class.api.php
/.well-known/acme-challenge/atomlib.php
/.well-known/acme-challenge/cloud.php
/.well-known/acme-challenge/
/.well-known/
</pre>
</blockquote>
<p>(Although '/.well-known/pki-validation' is a registered Well-Known
URI, I believe this use of it is as much of a security probe as the
pokes at acme-challenge are.)</p>
<p>There was a bit of use of <a href="https://github.com/google/digitalassetlinks/blob/master/well-known/specification.md">'/.well-known/assetlinks.json'</a>
and <a href="https://www.rfc-editor.org/rfc/rfc9116.html">'/.well-known/security.txt'</a>, and a long tail of
other things, only a few of them registered (and some of them possibly
less obviously malicious than people looking for '.php' URLs).</p>
<p>(We did see some requests for <a href="https://wiki.mozilla.org/Thunderbird:Autoconfiguration">Thunderbird's
'/.well-known/autoconfig/mail/config-v1.1.xml'</a>, which
perhaps we should support, although writing and validating a
configuration file looks somewhat complicated.)</p>
<p>There weren't that many requests overall, which isn't really surprising
given that we <a href="https://en.wikipedia.org/wiki/HTTP_404">HTTP 404'd</a>
all of them. What's left is likely to be the residual automation that
blindly tries no matter what and some degree of automated probes from
attackers. I admit I'm a bit sad not to have found any for <a href="https://en.wikipedia.org/wiki/WebFinger">WebFinger</a>
itself, because it would be a bit nifty if attackers were trying to mine
that (or we had people probing for OIDC IdPs, or some other WebFinger use).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/web/WellKnownQueriesAgainstUs?showcomments#comments">One comment</a>.) </div>What /.well-known/ URL queries people make against our web servers2024-02-26T21:43:53Z2023-12-16T04:05:46Ztag:cspace@cks.mef.org,2009-03-24:/blog/programming/GoPartialIfdefWithConstscks<div class="wikitext"><p>Recently on the Fediverse, <a href="https://mastodon.social/@timbray@cosocial.ca/111570362762318590">Tim Bray wished for #ifdef in Go code</a>:</p>
<blockquote><p>I would really REALLY like to have #ifdef in this Go code I’m
working on - there’s this fairly heavyweight debugging stuff that
I regularly switch in to chase a particular class of problems, but
don’t want active in production code. #ifdef would have exactly the
right semantics. Yeah, I know about tags.</p>
</blockquote>
<p>Thanks to modern compiler technology in the Go toolchain, we can
sort of emulate #ifdef through the use of build tags combined with
some other tricks. How well the emulation works depends on what you
want to do; for some things it's almost perfect and for other things
it's going to be at best awkward.</p>
<p>The basic idea is to take advantage of build tags combined with
<a href="https://en.wikipedia.org/wiki/Dead-code_elimination">dead code elimination (DCE)</a>. We'll use
tagged files to define a constant, say <code>doMyDebug</code>, to either <code>true</code>
or <code>false</code>:</p>
<blockquote><pre style="white-space: pre-wrap;">
$ cat ifdef-debug.go
//go:build !myrelease
package ...
const doMyDebug = true
$ cat ifdef-release.go
//go:build myrelease
package ...
const doMyDebug = false
</pre>
</blockquote>
<p>Now you can use '<code>if doMyDebug { .... }</code>' in your regular code as
a version of #ifdef. The magic of dead code elimination in Go will
remove all of your conditional debugging code if you build with a
'myrelease' tag and so define '<code>goMyDebug</code>' as false. Go's dead
code elimination is smart enough to eliminate not just the code
itself but also any data (such as strings) that's only used by that
code, any functions called only by that code (directly or indirectly),
any data used only by those functions, and so on (although none of
this can be exported from your package).</p>
<p>This works fine for one #ifdef equivalent. It works less fine if
you want a number of them, all controlled independently, because
then you need two little files per flag, which makes the clutter
add up fast. You can confine the mess by creating an internal package
to hold all of them, say 'internal/ifdef', and then importing it
in the rest of your code and using '<code>if ifdef.DoMyDebug { ... }</code>'
(the name has to be capitalized since now it has to be exported).</p>
<p>Where this starts to not work so well is if you want to do more
than put debugging code into functions. A semi-okay case is if you
want to keep some completely separate additional data (in separate
data structures) when debugging is turned on. Here your best bet
is probably to put all of the definitions and functions in your
conditionally built '<code>ifdef-debug.go</code>', with function stubs in
<code>ifdef-release.go</code>, and call the necessary functions from your
regular code. You don't need to make these calls conditional; Go
is smart enough to inline and then erase function calls to empty
functions (or functions that in non-debugging mode return a constant
'all is okay' result, and then it will DCE the never-taken code
branch). This requires you to keep the stub versions of the functions
in sync with the real versions; the more such functions you have
the worst things are probably going to be.</p>
<p>Probably the worst case is if you want to conditionally augment
some of your data structures with extra fields that are only present
(and used) when debugging is defined on. If you can have the fields
always present but only touched when debugging is defined on, this
is relatively straightforward (since we can protect all the code
with '<code>if ...</code>' and then let DCE eliminate it all when debugging
is off). However, if you want the fields to be completely gone with
debugging off (so that they don't take up any memory and so on),
then life is at best rather awkward. In the straightforward version
you need to duplicate the definition of these structures in equivalents
of <code>ifdef-debug.go</code> and <code>ifdef-release.go</code>, and only access the
extra debugging fields through functions that are also in
<code>ifdef-debug.go</code> (and stubbed out in <code>ifdef-release.go</code>). This will
probably significantly distort your code structure and make things
harder to follow and more error prone.</p>
<p>A less aesthetic version of adding extra data to data structures
only when debugging is on is to put all of the debugging data fields
into a separate struct type, and then put an instance of the entire
struct type in your main data structure. For example:</p>
<blockquote><pre style="white-space: pre-wrap;">
type myCoreType struct {
[...]
dd extraDebugData
[...]
}
</pre>
</blockquote>
<p>The real definition of <code>extraDebugData</code> and its fields is in your
<code>ifdef-debug.go</code> file, along with the functions that manipulate it.
Your <code>ifdef-release.go</code> stub file has an empty '<code>struct {}</code>'
definition of extraDebugData (and stub versions of all its functions).
Note that you don't want to put this extra data at the end of your
core struct, because <a href="https://i.hsfzxjy.site/zst-at-the-rear-of-go-struct/">a zero-sized field at the end of a struct
has a non-zero size</a>.
It may also be more difficult to get a minimally-sized myCoreType
structure that doesn't have alignment holes with debugging on,
depending on what debugging fields you're adding. This still has
the disadvantage that you can't manipulate these extra debugging
fields in line with the rest of your code; you have to call out to
separate functions that can be stubbed out.</p>
<p>(The reason for this is that even though the code may never be
executed, Go still requires it to be valid and to not do things
like access struct fields that don't exist.)</p>
<p>A variation of this with extra memory overhead that allows for
inline code is to always define the real <code>extraDebugData</code> struct
but use a pointer to it in <code>myCoreType</code>. Then you can set the pointer
and manipulate its fields through regular code guarded by '<code>if
doMyDebug</code>' (or perhaps '<code>if doMyDebug && obj.dd != nil</code>'), and
have it all eliminated when <code>doMyDebug</code> is constant false. This
creates a separate additional allocation for the <code>extraDebugData</code>
structure in debug mode and means your release builds have an extra
pointer field in <code>myCoreType</code> that's always nil.</p>
<p>All of this only works with constants for the <code>doMyDebug</code> names,
not with variables, which means you can't inject the 'ifdef' values
on the command line through <a href="https://pkg.go.dev/cmd/link">the Go linker</a>'s
support for setting string variable values with -X. You have to use
build tags and constants in order to get the dead code elimination
that makes this more or less zero cost when you're building a release
version.</p>
<p>(I suggest that you make the default, no tags version of your code
be the one with everything enabled and then specifically set build
tags to remove things. I feel that this is more likely to play well
with various code analysis tools and editors, because by default
(with no tags set) they'll see full information about fields, types,
functions, and so on.)</p>
<p>PS: There are probably other clever ways to do this.</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GoPartialIfdefWithConsts?showcomments#comments">One comment</a>.) </div>Partially emulating #ifdef in Go with build tags and consts2024-02-26T21:43:52Z2023-12-15T04:20:12Ztag:cspace@cks.mef.org,2009-03-24:/blog/linux/SystemdResolvedSingleNamesDNScks<div class="wikitext"><p>Suppose, not hypothetically, that you use <a href="https://www.freedesktop.org/software/systemd/man/systemd-resolved.service.html">systemd-resolved</a>
and you have a long standing practice of specific DNS search path
so that people can use short domain names. In this environment <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdResolvedNotFor">you
probably need to use systemd-resolved purely through /etc/resolv.conf</a>, and if you do this you may experience an
oddity:</p>
<blockquote><pre style="white-space: pre-wrap;">
$ ping nosuchname
ping: nosuchname: Temporary failure in name resolution
</pre>
</blockquote>
<p>If you try '<code>resolvectl query nosuchname</code>' it will tell you that the name
is not found, but if you directly query the systemd-resolved DNS server at
127.0.0.53 you will see that you get a DNS SERVFAIL response for the bare
name:</p>
<blockquote><pre style="white-space: pre-wrap;">
$ dig a nosuchname. @127.0.0.53
[...]
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 52471
[...]
</pre>
</blockquote>
<p>(You will wind up querying for the bare name when you've exhausted
all of the domains in your DNS search path.)</p>
<p>This is not what a normal DNS server like Unbound will return for
the same query; Unbound will return NXDOMAIN for this query, which
will cause programs like ping to tell you 'Name or service not
known', which is probably what you want. If you know what's going
on you can translate, but why worry yourself about the possibility
that something is really going wrong.</p>
<p>What is going on here is systemd-resolved's interpretation of how
to behave for DNS queries if <a href="https://www.freedesktop.org/software/systemd/man/latest/resolved.conf.html#ResolveUnicastSingleLabel="><code>ResolveUnicastSingleLabel</code></a>
is unset in your <a href="https://www.freedesktop.org/software/systemd/man/latest/resolved.conf.html">resolved.conf</a>.
How the documentation describes it is:</p>
<blockquote><p>Takes a boolean argument. When false (the default), systemd-resolved
will not resolve A and AAAA queries for single-label names over
classic DNS. [...]</p>
</blockquote>
<p>Since ping's attempts to find the IP address of 'nosuchname'
eventually wind up making a single-label name query to systemd-resolved,
with this setting in its default state systemd-resolved will not
try to resolve this query by sending it to an upstream DNS resolver
(where it would fail). When queried as a DNS server, resolved's
interpretation of 'will not (try to) resolve' is to return SERVFAIL
instead of NXDOMAIN. This is in some sense technically correct, but
it's usually not as useful as returning NXDOMAIN would be (and it's
not how Unbound or Bind behave).</p>
<p>If you have local DNS resolvers that systemd-resolved on your systems
is pointing to, you can safely set <code>ResolveUnicastSingleLabel=yes</code>
to work around this. Systemd-resolved will dutifully send these
queries to your local DNS resolvers, your local DNS resolvers will
NXDOMAIN them, and systemd-resolved will pass this NXDOMAIN back
to you so that ping tells you there's no such host. I'm probably
going to do this on my desktops (and any of <a href="https://support.cs.toronto.edu/">our</a> machines that wind up using
systemd-resolved).</p>
<p>(A lot of my understanding of this comes from finding and reading
<a href="https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/2024320">Ubuntu systemd bug #2024320</a>
and <a href="https://github.com/systemd/systemd/issues/28310">systemd issue #28310</a>.)</p>
<h3>Sidebar: Some thoughts on SERVFAIL versus NXDOMAIN here</h3>
<p>If you have upstream DNS servers that will actually return something
for A and AAAA queries for single-label names for some (local)
reason, systemd-resolved returning SERVFAIL and ping reporting it
as a 'temporary' failure in name resolution is probably doing you
a favour because it's signalling that something weird is going on
in your (DNS) name resolution. Systemd-resolved returning NXDOMAIN
might lead you to suspect that your upstream DNS servers didn't
have the data you expected them to.</p>
<p>However, this is a rare case. A much more usual case is going to
be what we saw here; you have a DNS search path, you type a name
that you implicitly expect to be in your local domain or not present
at all, and instead of 'name or service not known' because it's not
in your local domain you get some odd 'temporary failure' (which
doesn't happen if you use resolvectl to theoretically check directly).</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdResolvedSingleNamesDNS?showcomments#comments">2 comments</a>.) </div>Why systemd-resolved can give weird results for nonexistent bare hostnames2024-02-26T21:43:53Z2023-12-14T03:37:59Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/UsingBindNowForResolverscks<div class="wikitext"><p>As part of <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/CSLabNetworkLayout">our local network environment</a>,
we have some local DNS resolvers that people here use (or at least
are supposed to use). These resolvers handle multiple jobs; they
resolve our own normal DNS names (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/BinatAndSplitHorizonDNS">or some of them</a>), our internal only DNS names, and handle
all of the recursion for lookups for external names. Originally we
ran these resolvers using Bind on OpenBSD. When OpenBSD stopped
supporting Bind, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/UnboundZoneRefreshProblem">we switched to a setup using Unbound and NSD</a>. We needed NSD as well as Unbound because
we wanted our resolvers to have a full copy of our local zones, so
they wouldn't need our master DNS server to be up to answer those
names. The local NSD was the authoritative secondary for our DNS
zones, and the local Unbound knew to query it for them.</p>
<p>Unfortunately, we've recently had a variety of problems with this
OpenBSD Unbound configuration that resulted in a series of serious
DNS resolution failures. We tried some configuration shuffles like
<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/SplittingDNSResolvers">splitting out critical machines to a dedicated DNS resolver</a> and <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/DNSResolversAndIPRatelimits">uncertain ratelimit tuning</a>, but we weren't happy with them and
didn't really have confidence that they'd solve our problems. So
to deal with these issues in a way we were more confident with, we
switched over to using Bind on Ubuntu.</p>
<p>(We switched what is usually the less used DNS resolver over to Bind a
week ago, and the more used one today.)</p>
<p>I'm not going to claim that Bind is the right answer for everyone;
in general Unbound is a perfectly fine recursive resolver and I run
it on my own machines (usually without problems). The advantage of
Bind in our environment is that Bind has solid support for combining
recursive DNS resolution with being an authoritative secondary for
some zones, and we know how to configure this so that it works (and
interacts smoothly with our Bind-based stealth master DNS server).
The one area that Bind falls short in is ratelimits that are focused
on recursive resolvers instead of authoritative servers, but we put
a test Bind install through load tests and it held up fine (under
conditions that had generally caused our Unbound servers to stop
responding).</p>
<p>(Unbound originally had no support for acting as a secondary this
way. Current versions of Unbound appear to have support for some
form of it, but every time I read the "Authority Zone Options"
section of <a href="https://nlnetlabs.nl/documentation/unbound/unbound.conf/">unbound.conf(5)</a> my head
hurts and I'm left uncertain about what settings we'd actually want
to set. We know exactly how to set up Bind to do what we want.)</p>
<p>Switching to Ubuntu also has some pragmatic advantages, since we
already run a lot of Ubuntu machines and have a lot of tools for
dealing with them, including monitoring and metrics through <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGrafanaSetup-2019">our
Prometheus environment</a>. OpenBSD has
only limited support for the Prometheus host agent, never mind
other agents we might want to run. And Ubuntu LTS releases have
longer support periods than OpenBSD nominally does, although
<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/OpenBSDSupportPolicyResults">OpenBSD's short support periods mostly don't matter to us</a>.</p>
<p>(Even if we switch back to Unbound someday, I suspect that we might
well run Unbound on Ubuntu instead of returning to OpenBSD. Our
usage of OpenBSD is slowly but steadily shrinking down to mostly
firewalls, where PF is still by far our favorite firewall system.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/UsingBindNowForResolvers?showcomments#comments">4 comments</a>.) </div>We've switched (back) to using Bind for our local DNS resolvers2024-02-26T21:43:53Z2023-12-13T03:54:20Ztag:cspace@cks.mef.org,2009-03-24:/blog/web/WebProbeSpeedNewTLSCertificatecks<div class="wikitext"><p>For <a href="https://utcc.utoronto.ca/~cks/space/blog/web/MappingOutSSOAuthentication">reasons outside the scope of this entry</a>
I spent some time today setting up a new Apache-based web server.
More specifically, I spent some time setting up a new virtual host
on a web server I'd set up <a href="https://mastodon.social/@cks/111547061035518709">last Friday</a>. Of course this
virtual host had a TLS certificate, or at least was going to once
I had Let's Encrypt issue me one. Some of the time I'm a little
ad-hoc with the process of setting up a HTTPS site; I'll start out
by writing the HTTP site configuration, get a TLS certificate issued,
edit the configuration to add in the HTTPS version, and so on. This
can make it take a visible amount of time between the TLS certificate
being issued, and thus appearing in <a href="https://en.wikipedia.org/wiki/Certificate_Transparency">Certificate Transparency logs</a>, and there
being any HTTPS website that will respond if you ask for it.</p>
<p>This time around I decided to follow a new approach and pre-write
the HTTPS configuration, guarding it behind an Apache <IfFile> check
for the TLS certificate private key. This meant that I could activate
the HTTPS site pretty much moments after Let's Encrypt issued my
TLS certificate. I also gave this new virtual host it's own set of
logs, in fact two sets, one for the HTTP version and one for the
HTTPS version. Part of why I did this is because I was curious how
long after I got a TLS certificate it would be before people showed
up to probe my new HTTPS site.</p>
<p>(It's well known by now that all sorts of people monitor Certificate
Transparency logs for new names to probe. These days CT logs also
make new entries visible quite fast; it's easily possible to monitor
the logs in near real time. My own monitoring, which is nowhere
near state of the art, was mailing me less than five minutes after
the certificate was issued.)</p>
<p>If you've ever looked at this yourself, you probably know the answer.
It took roughly a minute before the first outside probes showed up
(from a 'leakix.org' IP address). Interestingly, this also provoked
some re-scans of the machine's first HTTPS website, which had been
set up Friday (and whose name was visible in, for example, the IP
address's reverse mapping). These scans were actually more thorough
than the scans against the new HTTPS virtual host. The HTTP versions
of both the base name and the new virtual host were also scanned at
the same time (again, the base version more thoroughly than the new
virtual host).</p>
<p>Our firewall logs suggest that the machine was getting hit with a
higher rate of random connections than before the TLS certificate
was issued, along with at least one clear port scan against assorted
TCP ports. This clear port scan took a while to show up, only
starting about twenty minutes after the TLS certificate was issued
(an eternity if you're trying to be the one who compromises a newly
exposed machine before it's fixed up).</p>
<p>At one level none of this is really surprising to me; I knew this
sort of stuff happened and I knew it could happen rapidly. At another
level there's a difference between knowing it and watching your
logs as it happens live in front of you.</p>
</div>
Seeing how fast people will probe you after you get a new TLS certificate2024-02-26T21:43:53Z2023-12-12T03:17:58Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/GrafanaLokiLogcliNotescks<div class="wikitext"><p>One of the pieces of <a href="https://grafana.com/oss/loki/">Grafana Loki</a>,
sometimes <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLokiSimpleNotRecommended">misleadingly described as 'Prometheus for logs'</a>, is <a href="https://grafana.com/docs/loki/latest/query/logcli/"><code>logcli</code></a>, an all purpose
command line program for querying Loki in various ways. Some of
what it can do is mostly of interest to Loki administrators, but
it has two major sub-commands for making <a href="https://grafana.com/docs/loki/latest/query/">LogQL</a> queries for either
<a href="https://grafana.com/docs/loki/latest/query/log_queries/">logs</a>
or <a href="https://grafana.com/docs/loki/latest/query/metric_queries/">metrics</a>.
I recently <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLokiFastFlexibleSearching">wrote a script that dealt with logcli</a> and in the process I learned
some things I want to write down for future use, although by the
time I use them Loki may have changed some of them.</p>
<p>Logcli has two sub-commands for log queries, '<code>logcli query</code>' for
queries over time and '<code>logcli instant-query</code>' for instant queries.
Although it is technically possible to make <a href="https://grafana.com/docs/loki/latest/query/metric_queries/">metrics</a> queries
witn '<code>logcli query</code>', you will normally use '<code>logcli instant-query</code>'
for this. Despite what logcli's help will tell you, instant queries
will only output some form of JSON; you can't get their results in
tabular form for text presentation, and you'll need to use, for
example, <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/JqFormattingTextNotes">jq's options for text output</a>.
Instant queries are made at some instant in time (the '--now'
argument) and the metrics query itself will normally use some
'<code>*_over_time</code>' operator with a duration. If you start out with
a start time and an end time in a script, deriving the duration may
involve <a href="https://mastodon.social/@cks/111489168616847343">GNU Date crimes</a>.</p>
<p>To get log lines themselves, you start with '<code>logcli query</code>'. For
time ranges, you can give it either --from and --to together or
'--since DUR', which is implicitly relative to 'now'. The largest
time duration modifier LogQL and logcli accept is hours; if you
want to query over days or weeks, you get to covert that into hours
yourself. Because a <a href="https://grafana.com/docs/loki/latest/query/">LogQL</a> query searches for a single thing in
a single set of logs, if you want to get multiple sorts of logs,
for example both SSH logins and IMAP logins, you'll need to run
'logcli' twice, each time with a separate query. If you want to put
things in time order regardless of what query the log line came
from, '<code>sort -V</code>' is your brute force tool (in combination with
some option to force the log lines to be presented as a single line
with the timestamp first).</p>
<p>(Also, 'logcli query' defaults to printing log lines in reverse
time order, so you probably want 'logcli query --forward ...',
unless you're already using 'sort -V' and don't care.)</p>
<p>By default, 'logcli query' (silently) limits the output to 30 log
entries. If you use '--limit 0', logcli issues multiple requests
to Loki, each one asking for '--batch' log entries (1000 by default)
and working out the time range that needs to be covered by the
query. You can see this if you look carefully at the queries that
logcli reports, and it's sort of covered in the logcli documentation
for <a href="https://grafana.com/docs/loki/latest/query/logcli/#batched-queries">batched queries</a>.
However, even with '--limit 0' logcli (and perhaps Loki) will have
problems reporting all of the log lines over a long enough time
interval. To get around this you seem to need to use 'logcli query'
parallelization, which is currently documented only in 'logcli help
query' and then only vaguely (this is a Loki tradition). The
simplest way to use query parallelization is to use 'logcli query
--limit 0 --parallel-max-workers N' where N is some reasonable
number like the number of CPUs you have. Apparently this can make
the logs be out of order, which is another reason to put them back
in the right order with 'sort -V'.</p>
<p>(In my traditional Loki experience, I don't really understand what's
going on here and I couldn't find the answers when I poked at the
documentation.)</p>
<p>In theory a LogQL metrics query ought to be more efficient and more
reliable than dumping out the necessary information from the log
lines and then generating the metric yourself. In practice, my
metric queries started failing once the duration got long enough,
so I abandoned doing them in favour of printing the necessary
information from each log line and feeding it through, for example,
'sort | uniq -c | sort -nr'. This also got me out of the business
of reformatting 'logcli instant-query' JSON into something textual.</p>
<p>(Because I was getting my metrics with '<code>sum( count_over_time(...)
) by (...)</code>', it's possible that the inner count_over_time()
had a high label cardinality (although most of them were ignored
by the sum()) and that's what blew up Loki. I don't know, all I
know is that now that I'm working out the metrics myself outside
of Loki, it works.)</p>
</div>
<div> (<a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLokiLogcliNotes?showcomments#comments">One comment</a>.) </div>Some notes on using the <code>logcli</code> program to query Grafana Loki2024-02-26T21:43:53Z2023-12-11T04:18:17Ztag:cspace@cks.mef.org,2009-03-24:/blog/sysadmin/GrafanaLokiFastFlexibleSearchingcks<div class="wikitext"><p>One of the ways <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/OurDifferentSysadminEnvironment">our environment is different from usual ones</a> is that we have a bunch of different
systems and services that lots of people log in to. We have a long
standing <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/CentralizeSyslog">central syslog server</a> that collects
syslog logs from all of our Linux servers, and one of the things
we've long used it for is to search for all of the recent logins
across our environment for a particular person. We don't do this
all that often but we do it often enough that we have a script for
it (which basically boils down to grep with the right patterns).</p>
<p>We also have a <a href="https://grafana.com/oss/loki/">Grafana Loki</a> server.
For all that <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLokiSimpleNotRecommended">I'm not entirely happy with Loki and can't recommend
it at our small scale</a>, I do like
using Loki <a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLokiWhatILikeItFor">for some things</a>. One of
the things that Loki is especially good at is narrow log searches,
where you want to look at some specific logs in some specific time
period. Recently, I decided to take our central syslog 'find all
logins for a person' script and re-do a version of it that used
Loki and was hopefully both easier to restrict to a narrow time
range and perhaps faster.</p>
<p>(When the dust settled, supporting narrow time ranges required <a href="https://mastodon.social/@cks/111489168616847343">GNU
Date crimes</a>.)</p>
<p>On the one hand, this wasn't as straightforward as I was hoping it
would be, mostly because of peculiar limitations of how <a href="https://grafana.com/docs/loki/latest/query/logcli/">logcli</a> behaves (it's
the Loki command line tool for making log queries, so any script
like this is going to reach for it as the first option). And <a href="https://grafana.com/docs/loki/latest/query/">LogQL</a> limitations forced
me to make multiple queries to Loki instead of rolling everything
into a single one, which made me do more work in the script to
present log lines in time order across the different services.</p>
<p>On the other hand, the result works, and because I was working with
<a href="https://grafana.com/docs/loki/latest/query/">LogQL</a> it was straightforward to reformat some of the information
into more useful forms by default (for example, defaulting to
summarizing sources of IMAP logins rather than report each login).
This reformatting was made easier by LogQL limitations forcing me
into those separate queries; since I was only getting one sort of
information from each query, it was easy to have LogQL's straightforward
pattern matching pull out just the information I was looking for
(usually the remote IP address) and report it.</p>
<p>(Recasting the syslog script (which was at its heart a giant 'grep'
with a big set of patterns) into a script that made separate queries
for each sort of information also made it easy to be selective about
what information it was reporting. If we only want SSH logins, well,
now that's easy.)</p>
<p>I haven't timed the Loki based script against our original version,
but in practice it's basically guaranteed to be faster for many
cases simply because it's easier to use a shorter time range in the
new script, or only look for certain sorts of logins instead of all
of them. Our syslog script uses a large time range by default, which
was right for some uses but not for many others, and it was
sufficiently painful and obscure to change that we mostly didn't.
The Loki script accepts easy to use time arguments and defaults to
a much smaller (and more accurate) time range.</p>
<p>(In theory the Loki based script should be faster because even if
Loki's decompression and searching isn't as fast as gzip and grep,
it's searching a lot less logs since I'm being narrowly selective
in log labels. But I haven't tried to specifically time it, and it
also does somewhat more than the syslog script because it has access
to some non-syslog log data. In practice the Loki based script runs
fast enough to be convenient.)</p>
<p>Overall I'm quite glad I got around to writing the Loki version.
I expect to use it periodically and be glad that I have it, and
I learned a certain amount about <a href="https://grafana.com/docs/loki/latest/query/logcli/">logcli</a> that will be useful
for the next time.</p>
<p>(Out of curiosity I just did a timing comparison, and for basically
the same time duration the syslog version took three minutes and
the Loki version two minutes. Shorter duration queries in Loki can
be much faster, although there may be caching effects at work.
Still, caching effects are useful if we're asking about several
different logins, as we sometimes are.)</p>
</div>
I recently used Grafana Loki for fast, flexible log searching2024-02-26T21:43:53Z2023-12-10T04:14:35Z