2023-12-06
Understanding another piece of per-cgroup memory usage accounting
A while back I wrote a program I call 'memdu' to report a du-like
hierarchical summary of how much memory is being used by each logged
in user and each system service, based on systemd's MemoryAccounting
setting and the general Linux cgroup (v2) memory accounting.
Cgroups expose a number of pieces of information about this, starting
with memory.current
, the current amount of memory 'being used by'
the cgroup and its descendants. What being used by means here is
that the kernel has attributed this memory to the cgroup, and it
counts all memory usage attributed to the cgroup, both user level
and in the kernel. As I very soon found out, this number can be misleading if
what you're really interested in is how much user level memory the
cgroup is actively using.
My first encounter with this was for a bunch of memory used by the kernel filesystem cache, which was attributed first to a running virtual machine and then to the general 'machine.slice' cgroup when the virtual machine was shut down and its cgroup went away. (Well, it was always attributed to machine.slice as well as the individual virtual machine, but when the virtual machine existed you could see that a lot of machine.slice's memory usage was from the child VM.)
As I recently discovered, another source of this is reclaimable
(kernel) slab memory. It's possible to have an essentially inactive
user cgroup with small process memory usage but gigabytes of memory
attributed to it from memory.stat's 'slab_reclaimable
'. At
some point this slab memory was actively used, but it's now not,
and presumably it lingers around mostly because the overall system
hasn't been under enough memory pressure to trigger reclaiming it.
Having my memdu program report the memory usage of the cgroup
including this memory is in one sense honest, but it's not usually
useful and it can be alarming.
(According to the documentation,
you can manually trigger a kernel reclaim against the cgroup by
writing an amount to 'memory.reclaim
'. But if there's no general
memory pressure, I think the only reason to do this is aesthetics.)
If I knew enough about the kernel memory systems in practice, I could probably read through the documentation about the cgroup memory.stat file and work out what things I wanted to remove from memory.current to get more or less 'current directly and indirectly used user memory'. As it is, I don't have that knowledge so I suspect that I'm going to find more cases like this over time.
(How I find these is that someday I run my memdu program and it reports an absurd looking number for some cgroup, so I investigate and then fix it up with more heuristics. These days the program is in Python so it's pretty easy to add another case.)
I suspect that one of the general issues I'm running into is that what I want from my 'memdu' program isn't well specified and may not be something that the kernel can really give me. The question of how much memory a cgroup is using depends on what I mean by 'using' and what sort of memory I care about. The kernel is only really set up to tell me how much memory has been attributed to a cgroup, and where it is in potentially overlapping categories in memory.stat.
(I assume that memory.stat
is comprehensive, so all memory in
memory.current
is accounted for somewhere in memory.stat
, but
I'm not sure of that.)
2023-12-04
Getting some information about the Linux kernel dentry cache (dcache)
The Linux kernel's dcache subsystem is its implementation of a name cache of directory entries; it holds dentries. As a (kernel) cache, it would be nice to know some information about this cache and how effective it was being for your worklog. Unfortunately the current pickings appear to be slim.
Basic information about the size of the dcache is exposed in /proc/sys/fs/dentry-state. This reports the total number of dentries, how many are 'unused', and how many are negative entries for files that don't exist (along with some other numbers). There's no information on either the lookup rate or the hit rate, and I believe that the kernel doesn't track this information at all (it sizes the dcache based on other things).
The BCC tools include a (BCC) program called dcstat. As covered in its documentation, this tool will print running dcache stats (provided that it works right on your kernel). The Storage and Filesystem Tools section of the BCC tools listings has additional tools that may be of interest in this general area. Although bpftrace has bpftrace-based versions of a lot of the BCC tools (see its tools/ subdirectory), it doesn't seem to have done a bpftrace version of dcstat.
(The other caution about dcstat is that based on comments in the dcstat source code I'm not sure that it's still right for current kernels. I think the overall usage rate is probably correct, but I'm not sure about the 'miss' numbers. I'd have to read fs/namei.c and fs/dcache.c very carefully to have much confidence.)
As far as I can see, /proc/sys/fs/dentry-state is not exposed by the Prometheus host agent. It might be exposed by the host agents for other metrics systems, or they might have left it out because there's not much you can do about the dcache anyway. If you wanted to export dcache hit and miss information, you could use the Cloudflare eBPF exporter and write an appropriate eBPF program for it, based on dcstat.
Now that I've looked at this, I suspect that while using dcstat may be interesting if you're curious about how many file lookups various operations do, it's probably not all that useful to monitor on an ongoing basis.
(In its current state, dcstat won't tell you how many hits were for negative dentries, which might be interesting to know so you can see how many futile lookups are happening on the system.)
2023-11-24
A peculiarity of the GNU Coreutils version of 'test
' and '[
'
Famously, '[
' is a program, not a piece of shell syntax, and it's
also known as 'test
' (which was the original name for it). On many systems, this was and is
implemented by '[
' being a hardlink to 'test
' (generally 'test'
was the primary name for various reasons). However, today I found
out that GNU Coreutils is an exception. Although the
two names are built from the same source code (src/test.c), they're
different binaries and the '[
' binary is larger than the 'test
'
binary. What is ultimately going on here is a piece of 'test
'
behavior that I had forgotten about, that of the meaning of running
'test
' with a single argument.
The POSIX specification for test is straightforward. A single argument is taken as a string, and the behavior is the same as for -n, although POSIX phrases it differently:
- string
- True if the string string is not the null string; otherwise, false.
The problem for GNU Coreutils is that GNU programs like to support
options like --help and --version. Support for these is specifically
disallowed for 'test
', where 'test --help
' and 'test --version
'
must both be silently true. However, this is not disallowed by POSIX
for '[
' if '[
' is invoked without the closing ']
':
$ [ --version [ (GNU coreutils) 9.1 [...] $ [ foo [: missing ‘]’ $ [ --version ] && echo true true
As we can see here, invoking 'test' as '[
' without the closing
']
' as an argument is an error, and GNU Coreutils is thus allowed
to interpret the results of your error however it likes, including
making '[ --version
' and so on work.
(There's a comment about it in test.c.)
The binary size difference is presumably because the 'test
' binary
omits the version and help text, along with the code to display it.
But if you look at the relevant Coreutils test.c code, the
relevant code isn't disabled with an #ifdef. Instead, LBRACKET is
#defined to 0 when compiling the 'test
' binary. So it seems that
modern C compilers are doing dead code elimination on the 'if
(LBRACKET) { ...}
' section, which is a well established optimization,
and then going on to notice that the called functions like 'usage()
'
are never invoked and dropping them from the binary. Possibly this is
set with some special link time magic flags.
PS: This handling of a single argument for test
goes all the way
back to V7, where test
was actually pretty smart. If I'm reading the V7 test(1) manual
page
correctly, this behavior was also documented.
PPS: In theory GNU Coreutils is portable and you might find it on any Unix. In practice I believe it's only really used on Linux.
2023-11-22
Understanding and sorting out ZFS pool features
Pretty much every filesystem that wants to be around for a long time needs some way to evolve its format, adding new things (and stopping using old ones); ZFS is no exception. In the beginning, the format of ZFS pools (and filesystems) was set by a version number, but this stopped working very well once Sun were no longer the only people evolving ZFS. To handle the situation with multiple people developing different changes to ZFS, ZFS created a system of what are called 'features', where each feature is more or less some change to how ZFS pools work. Most features are officially independent of each other (although they may not be tested independently in practice). All of this is documented today in the zpool-features(7) manual page, which discusses the general system in detail and then lists all of the current features.
(Your local copy of zpool-features(7) may well list fewer features than the latest upstream development version does. For instance, there's a feature for RAID-Z expansion, which only just landed in the development version.)
Each release or version of ZFS supports some set of features,
increasing over time. The Ubuntu 22.04 version of ZFS supports more
ZFS features than the Ubuntu 18.04 version did, for example. Moving
to a new version of ZFS (for example by upgrading your fileservers from Ubuntu 18.04 to 22.04) deliberately
doesn't change the features your current ZFS pools have. Only manual
action such as 'zpool upgrade -a
' will update them to use new
features, and you may well hold off on this even though you've
updated ZFS versions.
(One reason to hold off is that perhaps you're worried about reverting to your pre-upgrade state. Another reason is just that you haven't gotten around to it. In the old Solaris 10 days, a 'zpool upgrade' of a pool would cause some degree of service interruption, although I don't think that's supposed to happens today.)
In the very old days, 'zpool status -x' would consider available
pool format updates to be an 'error' that made a pool worthy of
including in its output, which was kind of infuriating. Later,
'zpool status' downgraded this to merely nagging you all the time. Finally, ZFS introduced a
pool property where you could specify what features you wanted your
pools to have, via compatibility feature sets
and setting the 'compatibility
' property to a suitable value. If
you set the pool's compatibility property to, say, 'openzfs-2.1-linux',
and your pool had all of those features, 'zpool status' now won't
claim that it's out of date. Unfortunately, 'zpool upgrade
' will
still report features that it claims can be upgraded to, although
any actual upgrade is supposed to be limited to the compatibility
features.
As part of these compatibility sets, there are files that list all of the features in each named set, normally found under /usr/share/zfs/compatibility.d. The format of these files is straightforward and can be used with diff to see that, for example, the features that were added between OpenZFS 2.1 for Linux and OpenZFS 2.2 were blake3, block_cloning, head_errlog, vdev_zaps_v2, and zilsaxattr (all of which you can read about in zpool-features(7)). Often there are convenient symbolic links, so you can see the difference in features that were present on Ubuntu 18.04 (where most of our current ZFS pools were created) and that are now available on Ubuntu 22.04 (which we're now running, so we could update pools to have the new features like zstd compression).
Basic information on what features each of your pools don't have
enabled yet can be seen with 'zpool upgrade'. Unfortunately there's
no convenient way to get this information for a single pool, because
'zpool upgrade POOL
' upgrades the pool, not lists not yet enabled
features for just that pool. Also, 'zpool upgrade' will list all
features, ignoring the constraints of any 'compatibility
' property
you may have set on the pool. You can use 'zpool status POOL
' to
see if a specific pool is fully up to date to its compatibility
property (if any), but that's all it can tell you; if it says that
the pool hasn't enabled all supported features, there's nothing
that will readily tell you which compatible features aren't yet
enabled while excluding features you've said are incompatible.
(As far as I can see from the code, upgrading a pool's features
through 'zpool upgrade' does respect its 'compatibility
' setting,
as documented. The current 'zpool upgrade' code to list features
that aren't enabled doesn't have any code to cross-check them against
your 'compatibility
', although I think it would be simple to add.)
Pool features are exposed as 'feature@<name>' ZFS pool properties,
so you can see a complete list of the features your version of ZFS
supports and their state for any particular pool with 'zpool get
all POOL
' (this comes for free with all other pool properties, so
if you want just the features you'll have to throw in a '| grep
feature@
'). This is the detailed state, so a feature can be
'disabled
', 'enabled
', or 'active
'; however, whether or not
the feature is read-only compatible
isn't listed. You can check a specific feature's state with, for
example, 'zpool get feature@block_cloning
', which can be
reassuring if there are reports that a particular feature might
cause ZFS pool corruption, prompting a new OpenZFS release with the
feature disabled in the kernel code.
(The OpenZFS 2.2.1 release prompted
my sudden interest in this area, since I run the ZFS development
versions, and caused me to realize that I had once again forgotten
how to get a full list of pool features and their state. Maybe I'll
remember 'zpool get all POOL
' this time around.)
PS: ZFS pool features and pool upgrades are a different thing from ZFS filesystem (format) upgrades. Filesystem format upgrades are still version number based, and I believe the last one was done back when Sun was still a going concern.
Sidebar: Some code trivia
Although ZFS features are represented in the pool by name, the current OpenZFS code has a big numbered list of all of the features it knows about, in include/zfeature_common.h. These are the features that, for example, 'zpool upgrade' will tell you that your pool doesn't have enabled. At the moment it appears that there are 41 of them (cf).
According to comments in module/zfs/zfeature.c, enabling a feature shouldn't have any effect, unlike what happened to us with pool version upgrades back in the Solaris days. This should mean that upgrading a pool is a low-impact operation, since unless you have a very old pool all it's doing is enabling a number of features (many of which may not even become active any time soon, such as zstd compression).
2023-11-21
Modern proxy (IPv4) ARP and proxy IPv6 NDP on Linux
Suppose, not hypothetically, that you have a remote system (on the other side of some tunnel or other connection) that wants to pretend to be on the local network, for either or both of IPv4 and IPv6. To make this work smoothly, this remote system's gateway (on the local network) needs to answer ARP requests for this remote system's IPv4 address and/or NDP requests for the remote system's IPv6 address. This is called 'proxy ARP' or 'proxy NDP', because the gateway is acting as an ARP or NDP proxy for the remote system.
At this point my memories are vague, but I think that in the old days, configuring proxy ARP on Linux was somewhat challenging and obscure, requiring you to add various magic settings in various places. These days it has gotten much easier and more uniform, and there are at least two approaches, the by hand one and the systemd one, although it turns out I don't know how to make systemd work for the IPv4 proxy ARP case.
The by hand approach is with the ip neighbour
(sub)command. This can be used to add IPv4 or IPv6 proxy
announcements to some network, which is normally the network
the remote machine is pretending to be on:
ip neigh add proxy 128.X.Y.Z dev em0 ip neigh add proxy 2606:fa00:.... dev em0 # apparently necessary echo 1 >/proc/sys/net/ipv6/conf/em0/proxy_arp
Here em0 is the interface that the 128.X.Y.0/24 and 2606:fa00:.../64 networks are on, where we want other machines to see 128.X.Y.Z (and its IPv6 version) as being on the network.
You can see these proxies (if any) with 'ip neigh show proxy
'.
To actually be useful, the system doing proxy ARP also generally
needs to have IP forwarding turned on and to have appropriate routes
or other ways to get packets to the IP it's proxying for.
Although there is a /proc/sys/net/ipv4/conf/*/proxy_arp setting (cf), it appears to be unimportant in today's modern 'ip neighbour' based setup. One of my machines is happily doing proxy ARP with this at the default of '0' on all interfaces. IPv6 has a similar ipv6/conf/*/proxy_ndp, but unlike with IPv4, the setting here appears to matter and you have to turn it on on the relevant interface; it's on for the relevant interface on my IPv6 gateway and turning it off makes external pings stop working.
(It's possible that other settings are affecting my lack of need for proxy_arp in my IPv4 case.)
The systemd way is to set up a systemd-networkd .network file that has the relevant settings. You set this on the interface where you want the proxy ARP or NDP to be on, not on the tunnel interface to the remote machine (as I found out). For IPv6, you want to set IPv6ProxyNDP= and at least one IPv6ProxyNDPAddress=, although it's not strictly necessary to explicitly set IPv6ProxyNDP (I'd do it for clarity). I was going to write something about how to do this for IPv4, but I can't actually work out how to do the equivalent of 'ip neigh add proxy ...' in systemd .network files; all they appear to do is support turning on proxy ARP in general, and I'm not sure what this does these days.
(If it's like eg this old discussion, then it may cause Linux to do proxy ARP for anything that it has routes for. There's also this Debian Wiki page suggesting the same thing.)
I don't know if NetworkManager has much support for proxy ARP or proxy NDP, since both seem somewhat out of scope for it.
PS: The systemd-networkd approach for IPv6 proxy NDP definitely results in an appropriate entry in 'ip -6 neigh show proxy', so it's not just turning on some form of general proxy NDP and calling it a day. That's certainly what I'd expect given that you list one or more proxy NDP addresses, but I like to verify these things.
2023-11-16
Setting up an IPv6 gateway on an Ubuntu 22.04 server with WireGuard
Recently we enabled IPv6 on one of our networks here (for initial testing purposes), but not the network that my office workstation is on. Naturally I decided that I wanted my office workstation to have IPv6 anyway, by using WireGuard to tunnel IPv6 to it from an IPv6 enabled Ubuntu 22.04 server on that network. For my sins, I also decided to do this the more or less proper way, which is to say through systemd-networkd, instead of through hand-rolled scripts.
(The absolutely proper way would be through Canonical's netplan, but netplan doesn't currently support WireGuard or some of the other features that I need, so I have to use systemd-networkd directly.)
The idea of the configuration is straightforward. My office workstation has an IPv6-only WireGuard connection to the Ubuntu server, a static IPv6 address in the subnet's regular /64 that's on the WireGuard interface, and a default IPv6 route through the WireGuard interface. The server does proxy NDP for my office workstation's static IPv6 address and then forwards traffic back and forth as applicable.
On the server, we have three pieces of configuration. First, we need to configure the WireGuard interface itself, in a networkd .netdev file:
[NetDev] Name=ipv6-wg0 Kind=wireguard [WireGuard] PrivateKey=[... no ...] ListenPort=51821 [WireGuardPeer] PublicKey=[... also no ...] AllowedIPs=<workstation IPv6>/128,fe80::/64 Endpoint=<workstation IPv4>:51821
We have to allow fe80::/64 as well as the global IPv6 address because in the end I decided to give this interface some IPv6 link local IPs.
The second thing we need is a networkd .network file to configure the server's side of the WireGuard interface. This must both set our local parameters and configure a route to the global IPv6 address of my workstation:
[Match] Name=ipv6-wg0 [Network] # Or make up a random 64 bit address Address=fe80::1/64 IPForward=yes # Disable things we don't want # Some of this may be unnecessary. DHCP=no IPv6AcceptRouterAdvertisements=no LLMNR=false [Route] Destination=<workstation IPv6>/128 [Link] # Not sure of this value, safety precaution MTUBytes=1359 RequiredForOnline=no
(If I was doing this for multiple machines, I think I would need one [Route] section per machine.)
The one thing left to do is make the server do proxy NDP, which has to be set on the Ethernet interface, not the WireGuard interface. In Ubuntu 22.04, server Ethernet interfaces are managed through netplan, but netplan has no support for setting up proxy NDP, although networkd .network files support this. So we must go behind netplan's back. In Ubuntu 22.04, netplan on servers creates systemd-networkd control files in /run/systemd/network, and these files have standard names; for example, if your server's active network interface is 'eno1', netplan will write a '10-netplan-eno1.network' file. Armed with this we can create a networkd dropin file in /etc/systemd/network/10-netplan-eno1.network.d that sets up proxy NDP, which we can call, say, 'ipv6-wg0-proxyndp.conf':
[Network] IPForward=yes IPv6ProxyNDP=yes IPv6ProxyNDPAddress=<workstation IPv6>
With all of this set up (and appropriate configuration on my office workstation), everything appears to work fine.
(On my office workstation, the WireGuard interface is configured with both the workstation's link local IPv6 address, with a peer address of the server's link-local address, and its global IPv6 address.)
All of this is pretty simple once I write it out here, but getting to this simple version took a surprising amount of experimentation and a number of attempts. Although it didn't help that I decided to switch to link local addresses after I'd already gotten a version without them working.
2023-11-08
Holding packages in Debian (and Ubuntu) has gotten easier over the years
In Debian (and thus Ubuntu), apt-get itself has no support for selectively upgrading packages, unlike DNF based distributions. In DNF, you can say 'dnf update package' or 'dnf update --exclude package' (with wildcards) to only update the package or to temporarily exclude package(s) from being updated. In apt-get, 'apt-get upgrade' upgrades everything. In order to selectively upgrade packages in modern apt-get, you can do 'apt-get install --only-upgrade package' (although I believe this marks the package as manually installed). In order to selectively exclude packages from upgrades, you need to hold them.
When we started using Ubuntu, holding and un-holding packages was an awkward process that involved piping things into 'dpkg --set-selections' and filtering the output of 'dpkg --get-selections'. Modern versions of Debian's apt suite has improved this drastically with the addition of the apt-mark command. Apt-mark provides straightforward sub-commands to hold and unhold packages and to list held packages; 'apt-mark hold package' (or a list of packages), 'apt-mark unhold package', and 'apt-mark showhold'. For extra convenience, the package names can include wildcards and apt-mark will do the right thing, or more or less the right thing depending on your tastes:
apt-mark hold amanda-*
Holding a package name with a wild card will hold everything that the wildcard matches, whether or not it's installed on your system. The wildcard above will match and hold the amanda-server package, which we don't have installed very many places, along with the amanda-common and amanda-client packages. This is what you want in some cases, but may be at least unaesthetic since you wind up holding packages you don't have installed.
If you want to only hold packages you actually have installed you need a dab of awk and probably you want to use 'dpkg --set-selections' directly. What we use is:
dpkg-query -W 'amanda-*' | awk 'NF == 2 {print $1, "hold"}' | dpkg --set-selections
(You can contrive a version that uses apt-mark but since apt-mark wants the packages to hold on the command line it feels like more work. Also, as an important safety tip, don't accidentally write this with 'dpkg' instead of 'dpkg-query' and then quietly overlook or throw away the resulting error message.)
Holding Debian packages is roughly equivalent to but generally better than DNF's version-lock plugin. It's explicitly specified as holding things regardless of version and will hold even uninstalled packages if you want that, which is potentially useful to stop things from getting dragged in. I have some things version-locked in DNF on my Fedora machines and I always feel a bit nervous about it; we feel no similar concerns on our Ubuntu machines, which routinely have various packages held.
If you normally have various sensitive packages held to stop surprise upgrades, the one thing to remember is that pretty much anything you do to manually upgrade them is going to require you to re-hold them again. If you want to use 'apt-get upgrade', you need to un-hold them explicitly; if you 'apt-get install' them to override the hold, the hold is removed. After one too many accidents, we wound up automating having some standard holds applied to things like kernels.
(Apt-mark can also be used to inspect and change the 'manually installed' status of packages, in case you want to fix this status for something you ran 'apt-get install' on to force an upgrade.)
2023-10-30
Finding which NFSv4 client owns a lock on a Linux NFS(v4) server
A while back I wrote an entry about finding which NFS client owns a lock on a Linux NFS server, which turned out to be specific to NFS v3 (which I really should have seen coming, since it involved NLM and lockd). Finding the NFS v4 client that owns a lock is, depending on your perspective, either simpler or more complex. The simpler bit is that I believe you can do it all in user space; the more complex is that as far as I've been able to dig, you have to.
Our first stop for NFS v4 locks is the NFS v4 information in /proc/locks. When you hold a (POSIX, NFS) lock, you will see an entry that looks like this:
46: POSIX ADVISORY READ 527122 00:36:3211286 0 EOF
This may be READ or WRITE, and it might have a byte range instead of being 0 to EOF. The '00:36:3211286' is the filesystem identifier (the '00:36' part, which is in hex) and then the inode number (in decimal, '3211286'). The other number, 527088, is the process ID of what is holding the lock. For a NFS v4 lock, this will always be some nfsd process, where /proc/<pid>/comm will be 'nfsd'. You'll have a number of nfsd processes (threads), and I don't know if it's always the same PID in /proc/locks.
(In addition, read locks can sometimes appear only as DELEG READ entries in /proc/locks, so they look exactly like simple client opens. It's possible to see multiple DELEG entries for the same file, if multiple NFS v4 clients have it open for reading and/or shared locking. If some NFS v4 client then attempts to get an exclusive lock to the file, the /proc/locks entry can change to a POSIX READ lock.)
To find the client (or clients) with the lock, our starting point
is /proc/fs/nfsd/clients, which
contains one subdirectory for each client. In these subdirectories,
the file 'info
' tells you what the client's IP is (and the name
it gave the server), and 'states
' tells you about what things the
particular NFS client is accessing in various ways, including
locking. Each entry in 'states
' has a type, and this type can
include 'lock
', and in an ideal world all NFS v4 locks would show
up as a states entry of this type. Life is not so nice for us,
because the state entry for held locks can also be 'type: deleg',
and not all 'type: deleg' entries represent held locks, even for
a file that is locked.
A typical states entry for a NFS v4 client may look like this:
- 0x...: { type: lock, superblock: "00:36:3211286", filename: "locktest/fred", owner: "lock id:\x..." }
A 'type: lock' entry can appear for either a shared lock or an exclusive one. Alternately a states entry can look like this:
- 0x...: { type: deleg, access: r, superblock: "00:36:3211286", filename: "locktest/fred" }
It's also possible to see both a 'type: deleg' and a 'type: lock' states entries for a file that has been opened and locked only once from a single client.
In all cases, the important thing is the 'superblock:' field, because this is the same value that appears in /proc/locks.
So as far as I can currently tell, the procedure to find the probable owners of NFS v4 locks is that first you go through /proc/locks and accumulate all of the POSIX locks that are owned by a nfsd process, remembering especially their combined filesystem and inode identification. Then you go through the /proc/fs/nfsd/clients states files for all clients, looking for any matching superblock: values for 'type: lock' or 'type: deleg' entries. If you find a 'type: lock' entry, that client definitely has the file locked. If you find a 'type: deleg' entry, the client might have the file locked, especially if it's a shared POSIX READ lock instead of an exclusive WRITE lock; however, the client might merely have the file open.
If you want to see what a given NFS v4 client (might) have locked, you can do this process backward. Read the client's /proc/fs/nfsd/clients states file, record all superblock: values for 'type: lock' or 'type: deleg' entries, and then see if they show up as POSIX locks in /proc/locks. This won't necessarily get all shared locks (which may show up as merely delegations in both the client's states file and in /proc/locks).
(Presumably the information necessary to locate the locking client or clients with more certainty is somewhere in the kernel data structures. However, so far I've been unable to figure it out in the way that I was able to pull out the NFS v3 lock owner information.)
PS: I'm going to be writing a Python tool for our use based on this digging, so I may get to report back later with corrections to this entry. For our purposes we care more about exclusive locks than shared locks, which makes this somewhat easier.
Sidebar: /proc/locks filesystem identifiers
The '00:36' subfield in /proc/locks that identifies the filesystem
is the major and minor device numbers of the stat(2) st_dev
field
for files, directories, and so on on the filesystem. To determine
these without stat'ing something on every filesystem, you can look
at the third field of every line in /proc/self/mountinfo, with the
provisio that /proc/self/mountinfo's field values are in decimal
and /proc/locks has it in hex.
(Unfortunately stat(1) doesn't provide the major and minor numbers separately, and its unified reporting doesn't match /proc/locks if the 'minor' number gets large enough.)
2023-10-12
Getting the active network interface(s) in a script on Ubuntu 22.04
Suppose, not entirely hypothetically, that we want to start using systemd-resolved on our Ubuntu 22.04 machines. One of the challenges of this is that the whole networking environment is configured through netplan, and in order for systemd-resolved to work well this means that your netplan configuration must have your full list of DNS resolvers and DNS search domains. We don't normally set these in netplan, because it's kind of a pain; instead we copy in an /etc/resolv.conf afterward.
It is possible to make automated changes to your netplan setup through netplan set. However, this needs to know the name of your specific Ethernet device, which varies from system to system in these modern days. This opens up the question of how do you get this name, and how do you get the right name on multi-homed machines (you want the Ethernet device that already has a 'nameservers:' line).
Netplan has netplan get but by itself it's singularly unhelpful. There are probably clever ways to get a list of fully qualified YAML keys, so you could grep for 'ethernets.<name>.nameservers' and fish out the necessary name there. Since netplan in our Ubuntu 22.04 server setup is relying on systemd-networkd, we could ask it for information through networkctl, but there's no straightforward way to get the necessary information.
(Networkctl does have a JSON output for 'networkctl list', but it's both too much and too little information. The 'networkctl status' output is sort of what you want but it's clearly intended for human consumption, not scripts.)
In practice our best bet is probably to look at where the default route points, which we can find with 'ip route show default':
; ip route show default default via 128.100.X.Y dev enp68s0f0 proto static
Alternately, we could ask for the route to one of our resolvers, especially if they're all on the same network:
; ip route get 128.100.X.M 128.100.X.M dev enp68s0f0 src 128.100.3.X.Q uid ... cache
In both cases we can pluck the 'dev <what>' out with something (for example awk, or 'egrep -o' if you feel conservative). This will give us the device name and we can then 'netplan set ethernets.<name>...' as appropriate.
If you have JSON-processing tools handy, modern versions of ip support JSON output via '-json'. This reduces things to:
; ip -json route show default | jq -r .[0].dev enp68s0f0 ; ip -json route get 128.100.X.M | jq -r .[0].dev enp68s0f0
These days, I think it's increasingly safe to assume you have jq or some equivalent installed, and this illustrates why.
In the world of systemd-resolved, we probably want Netplan's 'nameservers:' section attached to the Ethernet interface that we use to talk to the DNS resolvers even if our default route goes elsewhere. Fortunately in our environment it generally doesn't matter because our Ubuntu servers almost never have more than one active network interface.
(The physical servers generally come with at least two, but most machines only use one.)
If we want all interfaces, we can reach for either 'ip -br addr' or 'ip -br link', although in both cases we'll need to screen out DOWN links and 'lo', the loopback interface. If we know that all interesting interfaces have an IPv4 (or IPv6) address, we can use this to automatically exclude down interfaces:
; ip -4 -br addr lo UNKNOWN 127.0.0.1/8 enp68s0f0 UP 128.100.X.Q/24
(For IPv6, use -6.)
On some machines this may include a 'virbr1' interface that exists due to (local) virtual machines.
(In some environments the answer is 'your servers all get this information through DHCP'. In our environment all servers have static IPs and static network configurations, partly because that way they don't need a DHCP server to boot and get on the network.)
Sidebar: the weird option of looking at the networkd configuration
Netplan writes its systemd-networkd configuration to /run/systemd/network in files that in Ubuntu 22.04 are called '10-netplan-<device>.network'. Generally, even on a multi-interface machine exactly one of those files will have a 'Gateway=' line and some 'DNS=' and 'Domains=' lines. This file's name has the network device you want to 'netplan set'.
Actually relying on this file naming pattern is probably a bad idea. On the other hand, you could find this file and extract the interface name from it (it appears as 'Name=' in the '[Match]' section, due to how Netplan sets up basic fixed networking).
2023-10-01
Brief notes on doing TOTP MFA with oathtool
Time-Based One-time Passwords (TOTP) are
one of the most common ways of doing multi-factor authentication
today and are, roughly speaking,
the only one you can use if the machine you're authenticating on
is a Linux machine. Especially, I believe they're the only one you
can use if you want a command-line way of generating your MFA
authentication codes. While there are a number of programs to
generate TOTP codes, perhaps the most widely available one is
oathtool
,
part of OATH Toolkit.
There are a variety of tutorials on using oathtool to generate TOTP codes on the Internet, but the ones I read generally slid into gpg, and gpg is about where I nope out in any instructions. So here is the simple version:
oathtool -b --totp @private/asite/totp-seed
(If you want more familiar syntax, oathtool accepts '-' to mean to
read from standard input, so you can redirect into it or use cat
.)
Most websites give you the text form of their TOTP seed in base32, so we need to tell oathtool that. The totp-seed file should be unreadable by anyone but you, of course.
If we want somewhat more security we can encrypt the TOTP seed at rest and pipe it to oathtool:
magic-decrypt private/asite/totp-seed | oathtool -b --totp -
The 'magic-decrypt' bit is where common instructions drag in gpg and I tune out. If I had to do this today, I would use age, which can encrypt (and decrypt) using a symmetric key with no fuss or muss.
Some TOTP clients have a 'follow' mode where they will print out a new TOTP code when the clock advances enough to require it. I don't think oathtool can do this, but it can print out extra TOTP codes after the current one (with '--window').
And as a little side note, the oathtool in Ubuntu 20.04 appears to be non-functional for generating TOTP codes from base32 input, for at least the one website I tried. The version on Ubuntu 22.04 works. I don't know if this is a bug or some feature that the 20.04 oathtool doesn't have.
PS: Possibly there is a better command line tool for this that's packaged in Debian and Ubuntu, but oathtool is what I found in casual Internet searches. There are definitely other command line tools, eg totp-cli and totp.