A problem I'm having with my HiDPI display, remote X, and (X) cursors
When I set up my HiDPI display on my home Linux machine, I had to do some wrestling with general DPI and scaling settings but after that most everything just worked and I didn't think about it. Due to world and local events, I spent a chunk of today getting set up for an extended period of working from home, including getting my work exmh configured to display properly over remote X on my HiDPI home display.
(Exmh is one of the things that I really miss when I don't have X across the network, and my current DSL link is actually fast enough to make it useful for reading email. If I'm going to be working from home for an extended period of time, I need a good email environment so it was worth the effort to see if exmh could run decently over my DSL link.)
This worked in general (with a few mistakes along the way), but
after using exmh for a while I realized that the (X) mouse cursors
that I was seeing when my mouse was over the exmh windows were
unusually and suspiciously small, as if they hadn't been scaled up
to HiDPI levels. At first I thought that this was a TCL/TK issue, but then I looked
at the mouse cursors I was seeing in other programs run over the
remote X connection (such as
xterm), I saw the same issue. My local
xterm windows have
a mouse cursor that's the right size (roughly the size of a capital
letter in the
xterm), but an
xterm on our Ubuntu machines run
over remote X has one that's half the correct size. The same is
true of the cursors in exmh, GNU Emacs, and sam.
(In the process of writing this entry, I checked my office Fedora machine and to my surprise, these programs all work correctly there over a remote X connection.)
X mouse cursors are a very old thing and in the way of X they've gone through a number of evolutions over the years (and then things like GUI toolkits and theming added extra layers of fun). The result is relatively opaque and underdocumented, especially if what you care about is basic X stuff like xterm and TCL/TK (for natural reasons, most people focus on writing about full scale desktops like GNOME and KDE). I found a variety of things on the Internet, some of which didn't work for me and some of which aren't feasible because the remote machines are multi-user ones and not everyone doing remote X to it has a HiDPI display (I won't when we go back to work, for example).
These days, there are apparently cursor themes, as discussed a bit
in the Gentoo wiki
and this article
(and see also).
Some basic X programs in some environments pay attention to this,
through both X resources settings and environment variables (per
the Arch wiki), but
on our Ubuntu machines the various X programs seem to ignore the
environment variables (although this stackoverflow answer talks
On Fedora the
$XCURSOR_SIZE environment variable and so on does
Our Ubuntu machines have the libxcursor shared library installed
(as 'libXcursor.so') and a running
xterm uses it, but they don't
seem to have any X cursor files installed (we don't have the
xcursor-themes package present, for example). This may mean that
our Ubuntu machines are forced to fall back to some very old X
protocol thing and that X protocol thing only has one size, that
being the tiny non-HiDPI cursors. My Fedora machines do appear to
have cursor themes installed in stuff under /usr/share/icons, and
it looks like if I copy the right one ('Adwaita') to our Ubuntu
machine and set
$XCURSOR_THEME, my exmh,
xterm, and so on work right.
(I think that setting these environment variables in general is harmless for non-HiDPI sessions, because I believe that the X cursor library magically picks the right size based on your display DPI.)
I suspect that this is a sign that our Ubuntu machines don't really have all of the X related packages that they should have in order to make modern X programs happy in modern X environments (which definitely include HiDPI screens). I'm not sure what additional packages we need, though, which means that I have a new project. In the mean time, writing this entry has gotten me to do enough research to find a workaround for now.
What makes our Ubuntu updates driver program complicated
In response to yesterday's entry on how we sort of automate Ubuntu package updates, which involves a complicated driver program (written in Python) to control a bunch of ssh's to our machines, a commentator asked the perfectly sensible and obvious question:
Is there a reason this couldn’t be a bash script that invokes pdsh?
Ultimately the complexity of our driver program is caused by how the Ubuntu package update process is flawed. We might still have a Python program instead of a shell script if the process worked better, but it would at least be a simpler Python program.
There are a number of complicated things that our driver program does (and my list here is somewhat different than my list in my reply comment). The lesser one is that it parses the output of apt-get to determine what packages would be updated or nominally did get updated on machines during an update run. This parsing could theoretically be done in an awk script, but in Python we can take advantage of better data structures to make it clearer and gather more complex data. The obvious thing we do with this complex data is aggregate it by groups of machines that will all apply the same set of package updates; usually this drastically reduces the output down to something that's much easier to follow.
(One of the other things we do with this complex data is look for signs of mis-configurations in what Ubuntu packages are held, because sometimes either something goes wrong or a machine was not quite set up correctly. If we spot things like a Samba server package update that would be applied, we print a big warning. This has saved us from awkward problems several times. After the driver's initial scan has finished, we can exclude machines from updates, or we can bail out and hold the packages properly on the machine, then restart the whole process.)
After the initial scan for updates is done, the update driver enters a command loop where it asks what to do next. Typically we tell it to apply updates to everything, but you can also tell it to do a specific machine first, or exclude some machines from what will be updated, and a number of other things. Or you can quit out immediately if you don't actually want to apply updates (perhaps you were just checking what updates were pending). The command loop ends when the update driver thinks it has nothing left to do because all still-eligible machines have had updates applied; at this point the updates driver writes out its final summary and so on.
The most complicated portion of the program and the process is
actually applying the updates on each system. When we were basically
doing 'ssh host apt-get -y upgrade' in an earlier version of our
update automation, we found that it would periodically stall on
some host and then we would have a problem; sometimes apt-get wanted
to ask us a question, and sometimes it just ran into issues. So our
current approach is to run the updates in what '
ssh -t' and
apt-get think is an interactive environment, capture all of their
output without spewing it over our terminal, and then if things
seem to go wrong allow us to step into the session to answer
questions, sort things out, or just see where things stalled.
Mechanically we use the third party Python pexpect module, which I had
some learning experiences with (although
I see that the module has been updated since then).
(The driver's current way of detecting problems is if an update produces no output for a sufficiently long time. We can also immediately step in if we want to.)
In theory apt-get and dpkg have settings that should let the update process automatically pick the default answer for any question a package update wants to ask us. In practice, we don't trust the default answer to always be sensible on package upgrades, although we do try to tell dpkg to always pick our own local version of configuration files to cut down on the questions we get asked.
Because Ubuntu package updates and apt-get operations are slow, we want to be able to run package updates in parallel, although we don't always do so. This adds extra complications to stepping into apt-get sessions, as you might expect, and there's a certain amount of code to coordinate all of this. Also, if one session has to be stepped into, we don't want to automatically continue on to do other (serial) updates, in case this is a systemic issue with this set of updates that we want to deal with before we proceed. Similarly if one update session fails outright (with ssh returning an error code), the driver pauses and waits for further directions.
(The entire reason the driver exists is so that we don't have to do updates one by one with manual attention. If a particular package update turns out to require manual attention, we will often either hold the package to block the update until we can figure things out, or directly update the affected machines by hand. If we have to interact with an 'apt-get upgrade', running it directly on the machine instead of through the driver is better.)
The updates driver also has a second mode that is used to update
held packages. In this mode, we run '
apt-get install <...>' for
the specific packages we want to update, instead of the usual
apt-get upgrade', and the update driver's command loop now has
commands for selecting what package or packages should be updated
(we don't necessarily want to update all held packages on a machine).
This is typically used for things like kernel updates, where we
want to mass update all of our machines. Updates of per-machine
held packages (like the Samba server) are often done by just logging
in to the machine and doing the process by hand (we often want to
monitor daemon logs and so on anyway).
(There are also some ancillary modes of operation, like a dry run mode and a mode to just report on what held packages have pending updates. Additional features let us control which machines it operates on, including trying to update machines that aren't in our normal list of machines to update.)
PS: Probably the updates driver has too many features. Certainly it has features that we don't really use, and some that I'd forgotten about until I re-read its full help text. It's one of those programs where my enthusiasm may have gotten away from me when I wrote it.
How we sort of automate updating system packages across our Ubuntu machines
Every place with more than a handful of Unix systems has to figure out a system for keeping them up to date, because doing it entirely by hand is too time consuming and error prone. We're no exception, so we've wound up steadily evolving our processes into a decently functional but somewhat complicated setup for doing this to our Ubuntu machines.
The first piece is a cron job that uses apt-show-versions and a state file to detect new updates for a machine and send email listing them off to us. In practice we don't actually read these email messages; instead, we use the presence of them in the morning as a sign that we should go do updates. This cron job is automatically set up on all of our machines by our standard Ubuntu install.
(Things are not quite to the point where Ubuntu has updates every day, and anyway it's useful to have a little reminder to push us to do updates.)
The second piece is that we have a central list of our current Ubuntu systems. To make sure that the list doesn't miss any active machines, our daily update check cron job also looks to see if the system it's running on is in the list; if it's not, it emails us a warning about that (in addition to any email it may send about the system having updates). The warning is important because this central list is used to determine what Ubuntu machines we'll try to apply updates on.
Finally, we have the setup for actually applying the updates on
demand, which started out as a relatively simple Python program
that automated some
ssh commands and then grew much more complicated
as we ran into issues and problems. Its basic operation is to
off to all of the machines on that central list of them, get a list
of the pending updates through
apt-get, then let you choose to
go ahead with updating some or all of the machines (which is done
with another round of
ssh sessions that run
apt-get). The output
from all of the update sessions is captured and logged to a file,
and at the end we get a compact summary of what groups of packages
got updated on what groups of machines.
I call our system sort of automated because it's not completely hands off. Human action is required to run the central update program at all and then actively tell it to go ahead with whatever it's detected. If we're not around or if we forget, no updates get applied. However, we don't need to do anything on a per-machine basis, and unless something goes wrong the interaction we need to do with the program takes only a few seconds of time at the start.
(We strongly prefer not applying updates truly automatically; we like to supervise the process and make final decisions, just in case.)
Not all packages are updated through this system, at least routinely. A few need special manual procedures, and a number of core packages that could theoretically be updated automatically are normally 'held' (in dpkg and apt terminology) so they'll be skipped by normal package updates. We don't apply kernel updates until shortly before we're about to reboot the machine, for example, for various reasons.
Our central update driver is unfortunately a complicated program. Apt, dpkg, and the Debian package format don't make it easy to do a good job of automatically applying updates, especially in unusual situations, and so the update driver has grown more and more features and warts to try to deal with all of that. Sadly, this means that creating your own equivalent version isn't a simple or short job (and ours is quite specific to our environment).
Linux's iowait statistic and multi-CPU machines
Yesterday I wrote about how multi-CPU machines quietly complicate
the standard definition of iowait,
because you can have some but not all CPUs idle while you have
processes waiting on IO. The system is not totally idle, which is
what the normal Linux definition of iowait is about,
but some CPUs are idle and implicitly waiting for IO to finish.
Linux complicates its life because iowait is considered to be a
per-CPU statistic, like user, nice, system, idle, irq, softirq,
and the other per-CPU times reported in
As it turns out, this per-CPU iowait figure is genuine, in one
sense; it is computed separately for each CPU and CPUs may report
significantly different numbers for it. How modern versions of the
Linux kernel keep track of iowait involves something between brute
force and hand-waving. Each task (a process or thread) is associated
with a CPU while it is running. When a task goes to sleep to wait
for IO, it increases a count of how many tasks are waiting for IO
'on' that CPU, called
nr_iowait. Then if
nr_iowait is greater
than zero and the CPU is idle, the idle time is charged to iowait
for that CPU instead of to 'idle'.
(You can see this in the code in
The problem with this is that a task waiting on IO is not really attached to any particular CPU. When it wakes up, the kernel will try to run it on its 'current' CPU (ie the last CPU it ran on, the CPU who's run queue it's in), but if that CPU is busy and another CPU is free, the now-awake task will be scheduled on that CPU. There is nothing that particularly guarantees that tasks waiting for IO are evenly distributed across all CPUs, or are parked on idle CPUs; as far as I know, you might have five tasks all waiting for IO on one CPU that's also busy running a sixth task, while five other CPUs are all idle. In this situation, the Linux kernel will happily say that one CPU is 100% user and five CPUs are 100% idle and there's no iowait going on at all.
(As far as I can see, the per-CPU number of tasks waiting for IO
is not reported at all. A global number of tasks in iowait is
/proc/stat, but that doesn't
tell you how they're distributed across your CPUs. Also, it's
an instantaneous number instead of some sort of accounting of
this over time.)
There's a nice big comment about this in kernel/sched/core.c
nr_iowait(), if you have to find it because the
source has shifted). The comment summarizes the situation this way,
This means, that when looking globally, the current IO-wait accounting on SMP is a lower bound, by reason of under accounting.
(It also says in somewhat more words that looking at the iowait for individual CPUs is nonsensical.)
Programs that report per-CPU iowait numbers on Linux are in some sense not incorrect; they're faithfully reporting what the kernel is telling them. The information they present is misleading, though, and in an ideal world their documentation would tell you that per-CPU iowait is not meaningful and should be ignored unless you know what you're doing.
PS: It's possible that
provide useful information here, if you have a sufficiently modern
kernel. Unfortunately the normal Ubuntu 18.04 server kernel is not
The basics of
/etc/mailcap on Ubuntu (and Debian)
One of the things that is an issue for any GUI desktop and for many
general programs is keeping track of what program should be used
to view or otherwise handle a particular sort of file, like JPEGs,
.docx files. On Ubuntu and Debian systems, this is handled
in part through the magic file
/etc/mailcap, which contains a bunch
of mappings from MIME types to what should handle them, with various
trimmings. You can also have a personal version of this file in your
home directory as
In the old days when we didn't know any better, installing and
removing programs probably edited
/etc/mailcap directly. These
days the file is automatically generated from various sources,
including from individual snippet files that are stored in
/usr/lib/mime/packages. Various programs drop files in this
directory during package installation and then
magically run to rebuild
/etc/mailcap. One should not confuse
/usr/share/mime/packages; the latter
has XML files that are used by the separate XDG MIME application
(As the update-mime manpage covers, it also uses the
information found in
.desktop files in /usr/share/applications.)
As far as I know, the update-mime manpage is
the sole good source for information about the format of these
little snippets in /usr/lib/mime/packages and the eventual format
of mailcap entries. The format is arcane, with many options and
quite a lot of complex handling, and there is no central software
package for querying the mailcap data; for historical reasons,
everyone rolls their own, with things like the Python
and other parts of this come from the
(For fun, there are multiple generations of mailcap standards. We
start with RFC 1524
from 1993, and then extend from there. On Ubuntu systems, the
doesn't document all of the directives that
update-mime does, for
A single MIME type may have multiple mailcap entries once all of
the dust settles (plus the possibility of wildcard entries as well
as specific ones, for example a 'text/*' entry as well as a
'text/x-tex' one). For example, on our Ubuntu login servers, there
are no less than 7
/etc/mailcap entries for text/x-tex, and 13
for image/png and image/jpeg. In theory people using /etc/mailcap
are supposed to narrow down these entries based on whether or not
they can be used in your current environment (some only work in an
X session, for example) and their listed priorities. In practice the
mailcap parsing code you're using probably doesn't support the full
range of complexity on a current Ubuntu or Debian system, partly
because features have been added to the format over time, and it
may simple pick either the first or the last mailcap entry that
The Freedesktop aka XDG specifications have their own set of MIME association tracking standards, in the Shared MIME database specification and the MIME applications associations specification. These are used by, among other things, the xdg-utils collection of programs, which is how at least some GUI programs decide to handle files. I believe that these tools don't look at /etc/mailcap at all, but they do use MIME type information from .desktop files in /usr/share/applications and the XML files in /usr/share/mime/packages. They might even interpret it in the same way that update-mime does. The XDG tools and MIME associations all assume that you're using a GUI; they have no support for separate handling of a text mode environment.
Any particular GUI program might rely on the XDG tools, use mailcap, or perhaps both, trying XDG and then falling back on mailcap (parsed with either its own code or some library). A text mode program must use mailcap. I'm not sure how self-contained environments like Firefox and Thunderbird work, much less Chrome.
(See also the Arch Wiki page on default applications.)
An appreciation for Cinnamon's workspace flipping keyboard shortcuts
When I first started using Cinnamon (which was back here for serious use), I thought of it just as the closest good thing I could get to the old Gnome 2 environment (as Gnome 3 is not for me). Over time, I've come to appreciate Cinnamon for itself, and even propagate aspects of Cinnamon back into my desktop fvwm setup (such as keyboard control over window position and size). One of the little Cinnamon aspects I now appreciate is its slick and convenient keyboard handling of what I would call virtual screens and Cinnamon calls workspaces.
As basic table stakes, Cinnamon organizes workspaces sequentially and lets you move left and right through them with Ctrl + Alt + Right (or Left) Arrow. By default it has four workspaces, which is enough for most sensible people (I'm not sensible on my desktop). Where Cinnamon gets slick is that it has an additional set of keyboard shortcuts for moving to another workspace with the current window, Ctrl + Alt + Shift + Left (or Right). It turns out that it's extremely common for me to open a new window on my laptop's relatively constrained screen, then decide that things are now too cramped and busy in this workspace and I want to move. The Cinnamon keyboard shortcuts make that a rapid and fluid operation, and I can keep moving the window along to further workspaces by just hitting Right (or Left) again while still holding down the other keys.
(As I've experienced many times before, having this as an easy and rapid operation encourages me to use it; I shuffle windows around this way on my laptop much more than I do on my desktops, where moving windows between virtual screens is a more involved process that generally requires multiple steps.)
Every so often I've thought about trying to create a version of this keyboard shortcut in fvwm, but so far I haven't seen a good way to do it. Although fvwm has functions and supports some logic operations, the feature's actually got a variety of challenges in fvwm's model of windows and what the current window can be. I'm pretty sure that if I looked at the actual Cinnamon code for this, it would turn out to be much more complicated than you'd expect from such a simple-sounding thing.
(I already have a keyboard shortcut for just moving to a different virtual screen; the tricky bit in fvwm is taking the current window along with me when (and only when) it's appropriate to do so given the state of the current window. I suppose the easy way to implement this is to assume that if I hit the 'take the window with me' shortcut, I've already determined that what fvwm considers the current window should be moved to the target virtual screen and my fvwm function can just ignore all of the possible weird cases.)
The uncertainty of an elevated load average on our Linux IMAP server
We have an IMAP server, using Dovecot on Ubuntu 18.04 and with all of its mail storage on our NFS fileservers. Because of historical decisions (cf), we've periodically had real performance issues with it; these issues have been mitigated partly through various hacks and partly through migrating the IMAP server and our NFS fileservers from 1G Ethernet to 10G (our IMAP server routinely reads very large mailboxes, and the faster that happens the better). However, the whole experience has left me with a twitch about problem indicators for our IMAP server, especially now that we have a Prometheus metrics system that can feed me lots of graphs to worry about.
For a while after we fixed up most everything (and with our old
OmniOS fileservers), the IMAP
server was routinely running at a load average of under 1. Since
then its routine workday load average has drifted upward, so that
a load average of 2 is not unusual and it's routine for it to be
over 1. However, there are no obvious problems the way there used
to be; '
top' doesn't show constantly busy IMAP processes, for
example, indicators such as the percentage of time the system spends
in iowait (which on Linux includes waiting for NFS IO) is consistently low, and our IMAP stats
monitoring doesn't show any clear slow commands the way it used to.
To the extent that I have IMAP performance monitoring, it only shows
slow performance for looking at our test account's INBOX, not really
(All user INBOXes are in our NFS
/var/mail filesystem and some
of them are very large, so it's a really hot spot and is kind of
expected to be slower than other filesystems; there's only really
so much we can do about it. Unfortunately we don't currently
have Prometheus metrics from our NFS fileservers, so I can't easily tell if there's some
obvious performance hotspot on that fileserver.)
All of this leaves me with two closely related mysteries. First, does this elevated load average actually matter? This might be the sign of some real IMAP performance problem that we should be trying to deal with, or it could be essentially harmless. Second, what is causing the load average to be high? Maybe we frequently have blocked processes that are waiting on IO or something else, or that are running in micro-bursts of CPU usage.
(eBPF based tracing might be able to tell us something about all of this, but eBPF tools are not really usable on Ubuntu 18.04 out of the box.)
Probably I should invest in developing some more IMAP performance measurements and also consider doing some measurements of the underlying NFS client disk IO, at least for simple operations like reading a file from a filesystem. We might not wind up with any more useful information than we already have, but at least I'd feel like I was doing something.
The case of mysterious load average spikes on our Linux login server
We have a Linux login server that is our primary server basically by default; it's the first one in numbering and the server a convenient alias is pointed to, so most people wind up using it. Naturally we monitor its OS level metrics as part of our Prometheus setup, and as part of that a graph of its load average (along with all our other interesting servers) appears on our overview Grafana dashboard. For basically as long as we've been doing this, we've noticed that this server experiences periodic and fairly drastic short term load average spikes for no clear reason.
A typical spike will take the 1-minute load average from 0.26 or
so (the typical load average for it) up to 6.5 or 7 in a matter of
seconds, and then immediately start dropping back down. There seems
to often be some correlation with other metrics, such as user and
system CPU time usage, but not necessarily a high one. We capture
top output periodically for reasons beyond the scope of
this entry, and these captures have never shown anything in particular
even when they capture the high load average itself. The spikes
happen at all times, day or night and weekday or weekend, and don't
seem to come in any regular pattern (such as every five minutes).
The obvious theory for what is going on is that there are a bunch
of processes that have some sort of periodic wakeup where they do
a very brief amount of work, and they've wound up more or less in
sync with each other. When the periodic wakeup triggers, a whole
bunch of processes become ready to run and so spike the load average
up, but once they do run they don't do very much so the log-jam
clears almost immediately (and the load average immediately drops).
Since it seems to be correlated with the number of logins, this may
be something in systemd's per-login process infrastructure. Since
all of these logins happen over SSH, it could also partly be because
we've set a
ClientAliveInterval in our sshd_config so sshd
likely wakes up periodically for some connections; however, I'm not
clear how that would wind up in sync for a significant number of
I don't know how we'd go about tracking down the source of this without a lot of work, and I'm not sure there's any point in doing that work. The load spikes don't seem to be doing any harm, and I suspect there's nothing we could really do about the causes even if we identified them. I rather expect that having a lot of logins on a single Linux machine is now not a case that people care about very much.
I'm likely giving up on trying to read Fedora package update information
Perhaps unlike most people, I apply updates to my Fedora machines
through the command line, first with
yum and now with
part of that, I have for a long time made a habit of trying to read
the information that Fedora theoretically publishes about every
package update with '
dnf updateinfo info', just in case there was
a surprise lurking in there for some particular package (this has
sometimes exposed issues, such as when I discovered that Fedora
maintains separate package databases for each user).
Sadly, I'm sort of in the process of giving up on doing that.
The overall cause is that it's clear that Fedora does not really
care about this update information being accurate, usable, and
accessible. This relative indifference has led to a number of
specific issues with both the average contents of update information
and to the process of reading it that make the whole experience
both annoying and not very useful. In practice, running '
updateinfo info' may not tell me about some of the actual updates
that are pending, always dumps out information about updates that
aren't pending for me (sometimes covering ones that have already
been applied, for example for some kernel updates), and part of
the time the update information itself isn't very useful and has
'fill this in' notes and so on. The result is verbose but lacking
in useful information and frustrating to pick through.
The result is that '
dnf updateinfo info' has been getting less
and less readable and less useful for some time. These days I skim
it at best, instead of trying to read it thoroughly, and anyway
there isn't much that I can do if I see something that makes me
wonder. I can get most of the value from just looking at the package
list in '
dnf check-update', and if I really care about update
information for a specific package I see there I'm probably better
off doing '
dnf updateinfo info <package>'. But still, it's a hard
to let go of this; part of me feels that reading update information
is part of being a responsible sysadmin (for my own personal
Some of these issues are long standing ones. It's pretty clear that
the updateinfo (sub)command is not a high priority in DNF as far
as bug fixes and improvements go, for example. I also suspect that
some of the extra packages I see listed in '
dnf updateinfo info'
are due to DNF modularity
I'm seeing updateinfo for (potential) updates from modules that
either I don't have enabled or that '
dnf update' and friends are
silently choosing to not use for whatever reasons. Alternately they
are base updates that are overridden by DNF modules I have enabled;
it's not clear.
(Now that I look at '
dnf module list --enabled', it seems that I
have several modules enabled that are relevant to packages that
updateinfo always natters about. One update that updateinfo talks
about is for a different stream (libgit2 0.28, while I have the
libgit2 0.27 module enabled), but others appear to be for versions
that I should be updating to if things were working properly.
Unfortunately I don't know how to coax DNF to show me what module
streams installed packages come from, or what it's ignoring in the
main Fedora updates repo because it's preferring a module version
A network interface losing and regaining signal can have additional effects (in Linux)
My office at work features a dearth of electrical sockets and as a result a profusion of power bars and other means of powering a whole bunch of things from one socket. The other day I needed to reorganize some of the mess, and as part of that I wound up briefly unplugging the power supply for my 8-port Ethernet switch that my office workstation is plugged into. Naturally this meant that the network interface lost signal for a bit (twice, because I wound up shuffling the power connection twice). Nothing on my desktop really noticed, including all of the remote X stuff I do, so I didn't think more about it. However, when I got home, parts of my Wireguard tunnel didn't work. I eventually fixed the problem by restarting the work end of my Wireguard setup, which does a number of things that including turning on IP(v4) forwarding on my workstation's main network interface.
I already knew that deleting and then recreating an interface entirely can have various additional effects (as happens periodically when my PPPoE DSL connection goes away and comes back). However this is a useful reminder to me that simply unplugging a machine from the network and then plugging it in can have some effects too. Unfortunately I'm not sure what the complete list of effects is, which is somewhat of a problem. Clearly it includes resetting IP forwarding, but there may be other things.
(All of this also depends on your system's networking setup. For instance, NetworkManager will deconfigure an interface that goes down, while I believe that without it, the interface's IP address remains set and so on.)
I'm not sure if there's any good way to fix this so that these settings are automatically re-applied when an interface comes up again. Based on this Stackexchange question and answer, the kernel doesn't emit a udev event on a change in network link status (it does emit a netlink event, which is probably how NetworkManager notices these things). Nor is there any sign in the networkd documentation that it supports doing something on link status changes.
(Possibly I need to set '
IgnoreCarrierLoss=true' in my networkd
settings for this interface.)
My unfortunate conclusion here is that if you have a complex networking setup and you lose link carrier on one interface, the simplest way to restore everything may be to reboot the machine. If this is not a good option, you probably should experiment in advance to figure out what you need to do and perhaps how to automate it.
(Another option is to work out what things are cleared or changed in your environment when a network interface loses carrier and then avoid using them. If I turned on IP forwarding globally and then relied on a firewall to block undesired forwarding, my life would probably be simpler.)