Wandering Thoughts

2023-05-18

What a desktop environment is on modern Linux

Recently I read KDE Plasma is NOT a Desktop Environment (via), which maintains that it's more like an environment construction kit, out of which one could build multiple environments. I have some reactions to this, and also I have some opinions on what a desktop environment even is on a modern Linux system (opinions which may count as a bit heretical).

The classical Unix vision of a desktop environment is that it's basically a window manager and a suite of graphical applications built around a common look and feel, usually using a common GUI library/toolkit. These GUI applications will usually include a file manager and often include various other productivity applications. Although you sort of have this in GNOME and KDE, this is not really what a desktop environment needs to do today on Linux.

On modern Linux, a usable graphical experience has a lot of moving parts, many of which the person using it expects to manage through a GUI. It needs things like an audio system, a system to handle removable media, widgets to log out, lock the screen, and reboot the system, integration with network management, a central preferences management system that applies to all of 'its' applications and really wants to ripple through to applications using other toolkits, and the ability to handle things like additional screens showing up or people wanting to change the screen resolution (which you need to auto-detect). As it happens, there are relatively well defined systems to handle many of these jobs (and more), and often relatively well defined means of talking to them through D-Bus.

(For instance, the modern Linux audio experience is mostly based on PipeWire, at least at the moment.)

A modern desktop environment is something that supplies and integrates all of those pieces and moving parts to provide an experience where everything 'just works', where audio plays when you want it to and you have an on-screen volume slider, where you can click on a widget to control your VPN (or get the ability to configure a new one), and so on. It probably comes with some applications of its own to, for example, handle its preferences system and things like window management keyboard shortcuts, but many applications that would previously have been considered part of the desktop environment are outsourced now. Almost everyone is going to use LibreOffice and either Firefox or Chrome, for example, and there is broadly no need to reimplement things like a terminal emulator (although a desktop can if it wants to).

You can of course build such a desktop environment yourself, with sufficient work. There are window managers, taskbars, status bars, applets, launchers, things to parse .desktop files to create nice launcher menus, and so on and so forth, and you can assemble them into a working configuration. But there is an exhaustingly large amount of work (and it keeps churning), so at a certain point most people give up doing it themselves, as I did when I started using Cinnamon on my laptop. A modern Linux desktop environment is a system integrator; it collects all of the pieces and connects them up so that you don't have to learn how to do it yourself (and then find or write programs that do the work).

For historical reasons, the two largest such integrators (GNOME and KDE) come with their own GUI look and feel, implemented by their own toolkits, and a variety of core and third party applications that use their toolkits and thus their look and feel. But this is not essential. Cinnamon reuses a lot of GNOME pieces, while XFCE has a relatively modest set of applications and while it has its own toolkit, I don't think it's widely used by third party programs. But XFCE is still a full scale modern desktop environment, because it does all of that hard integration work for you, and you can just use it.

(As far as I know no one has attempted to write down in one place (or maintain a set of links to) everything that you need to support, connect together, run as part of your session, send D-Bus messages to, listen to D-Bus messages from, and so on. Even if someone managed that heroic feat, keeping it up to date would be an ongoing job, never mind trying to suggest programs and configurations to implement it all.)

ModernDesktopEnvironments written at 22:44:18; Add Comment

2023-05-15

The time our Linux systems spend on integer to text and back conversions

Over on the Fediverse, I said something recently:

I sometimes think about all the CPU cycles that are used on Linux machines to have the kernel convert integers to text for /proc and /sys files and then your metrics system convert the text back to integers. (And then sometimes convert the integers back to text when it sends them to the metrics server, which is at least a different machine using CPU cycles to turn text back into integers (or floats).)

It's accidents of history all the way down.

We run the Prometheus host agent on all of our Linux machines. Every fifteen seconds our Prometheus server pulls metrics from all the host agents, which causes the host agent to read a bunch of /proc files (for things like memory and CPU state information) and /sys files (for things like hwmon information). These status files are text, but they contain a lot of numbers, which means that the kernel converted those integers into text for us. The host agent then converts that text back into numbers internally (I believe a mixture of 64-bit integers and 64-bit floats), only to turn around and send them to the Prometheus server as text again (see Exposition Formats, also). On the Prometheus server these text numbers will be turned back into floats. All of this takes CPU cycles, although perhaps not many CPU cycles on modern machines.

(The host agent gets some information from the Linux kernel through methods like netlink, which I believe transfers numbers in non-text form.)

All of the steps of this dance are rational ones. Things in /proc and /sys use text instead of some binary encoding because text is a universal solvent on Unix systems, and that way no one had to define a binary file format (or worse, try to get agreement on a general binary system stats kernel to userspace API). Text formats are usually easily augmented, upgraded, inspected, and so on, and they are easy to provide (the kernel actually has a lot of infrastructure for easily providing text in /proc files; we saw some of it in action recently).

(These factors are especially visible in the case of some of the statistics that OpenZFS on Linux exposes. ZFS comes from Solaris, which has a native binary 'kstat' system. ZoL exposes all of these kstats in /proc/spl/kstat/zfs as text, rather than try to get Linux people to somehow get them as binary kstats. Other ZFS IO statistics are exposed in an entirely different and more binary form.)

Changing the situation would require a lot of work by a lot of people spread across a lot of projects, so it's unlikely to be done. If it is ever done, it will probably be done piecemeal, maybe through more and more kernel subsystems exposing information through netlink as well as /proc (perhaps exposing new metrics only through netlink, with their /proc information frozen). But even netlink is probably more work for kernel developers than putting things in /proc, so I suspect that a lot of things will keep being in /proc.

(In addition, lots of things in /proc aren't just pairs of names and numbers, although that's the common case. Consider /proc/locks.)

KernelIntegersToTextThought written at 22:19:01; Add Comment

2023-05-08

When to use drgn instead of eBPF tools like bpftrace, and vice versa

I talked recently about drgn and using it to poke around in the kernel, and yesterday I followed that up with an example of finding out which NFS client owns a file lock that used bpftrace (and also I discussed using drgn for this). As an outsider, you might reasonably wonder when you'd use one and when you'd use the other on the kernel. I won't claim that I have a complete answer, but here's what I know so far.

(Both bpftrace and drgn can do things with user programs too, but I haven't tried either for this.)

The simple version is that bpftrace is for doing things when events happen in the kernel and drgn is for pulling information out of kernel variables and data structures. Bpftrace has a crossover ability to pull some information out of some data structures (that's part of what makes it so useful), but often it's much more limited than drgn.

Bpftrace will let you 'trace' kernel events, including events like function calls, and do various things when they happen, such as extracting information from arguments to the events (including function arguments, as we saw with the NFS locks example). However, bpftrace has only limited support for pretty-printing things, limited access to kernel global variables (today it appears unable to access many module globals), and can't do much with kernel data structures like linked lists or per-cpu variables. Bpftrace will work out of the box on almost any modern Linux kernel in its stock setup; at most you'll need the kernel headers.

One painful example of a bpftrace limitation, many interesting kernel data structures contain a 'struct path' that can be used to give you the full path to the object involved, such as a file that's locked, a file being accessed over NFS, or a NFS mount point. Bpftrace generally has very limited ability to traverse these path data structures to turn them into the actual path, while drgn has a simple helper for it.

(One reason for this limitation is that the kernel won't allow eBPF bytecode to have unpredictable, potentially unbounded runtime.)

So, for a non-hypothetical example, if you want to get a top-like view of NFS server activity broken down by user or client, you need bpftrace (see the very impressive nfsdtop), even though some aspects are rather awkward, because you need to 'trace' NFS requests.

Drgn is great for pretty-printing kernel data structures and extracting relatively arbitrary information from them, both for interactive exploration and to be automated in programs. However, the data you're interested in mostly needs to be reachable from some kernel global variable, and figuring out how to get from some global variable to the data you want can be an adventure. In addition, drgn requires per-kernel setup on any machine you want to use it on, because it requires kernel debugging information that most distributions don't install by default.

If both bpftrace and drgn can reach the kernel data structures you're interested in, drgn in interactive mode is generally going to be much more convenient for exploring them. It has much better pretty-printing support, it will readily tell you about all of the types involved, and its interactive mode is much faster than repeatedly modifying and re-starting bpftrace programs to print a few more things.

However, if you want to inspect short-lived objects, for example ones that are only passed around as function arguments and are deallocated when the operation is over, you need bpftrace. A short lived, dynamically allocated object is beyond drgn's feasible reach. As an example, if you want to snoop into the data structures that NFS servers use to represent requests from NFS clients while the requests are being processed, you're going to need bpftrace.

(If you have a hybrid situation where there is a long lived data structure that isn't reachable from global variables, I suppose you could get bpftrace to print its address as exposed during a function call, then immediately turn to drgn to start dumping memory.)

DrgnVersusEBPFTools written at 23:14:43; Add Comment

2023-05-06

Finding which NFS client owns a lock on a NFS server via Linux kernel delving

Suppose that you have some Linux NFS servers, which have some NFS locks, and you'd like to know which NFS client owns which lock. Since the NFS server can drop a client's locks when it reboots, this information is in the kernel data structures, but it's not exposed through public interfaces like /proc/locks. As I mentioned yesterday while talking about drgn, I've worked out how to do this, so in case someone's looking for this information, here are the details. This is as of Ubuntu 22.04, but I believe this code is relatively stable (although where things are in the header files has changed since 22.04's kernel).

In the rest of this I'll be making lots of references to kernel data structures implemented as C structs in include/linux/fs.h, include/linux/lockd/lockd.h, and include/linux/filelock.h. To start with, I'll introduce our cast of characters, which is to say various sorts of kernel structures.

  • 'struct nlm_host' represents a NFS client (on an NFS server), or more generally a NLM peer. It contains the identifying information we want in various fields, and so our ultimate goal is to associate (NFS) file locks with nlm_hosts. I believe that a given nlm_host can be connected to multiple locks, since a NFS client can have many locks on the server.
  • 'struct nlm_lockowner' seems to represent the 'owner' of a lock. It's only interesting to us because it contains a reference to the nlm_host associated with the lock, in '.host'.

  • 'struct lock_manager_operations' is a set of function pointers for lock manager operations. There is a specific instance of this, 'nlmsvc_lock_operations', which is used for all lockd/NLM locks.

  • 'struct file_lock' represents a generic "file lock", POSIX or otherwise. It contains a '.fl_lmops' field that points to a lock_manager_operations, a '.fl_pid' field of the nominal PID that owns the lock, a '.fl_file' that points to the 'struct file' that this lock is for, and a special '.fl_owner' field that holds a 'void *' pointer to lock manager specific data. For lockd/NLM locks, this is a pointer to the associated 'struct nlm_lockowner' for the lock, from which we can get the nlm_host and the information we want.

    All lockd/NLM locks will have a '.fl_lmops' field that points to 'nlmsvc_lock_operations' and a '.fl_pid' that has lockd's PID.

    (The POSIX versus flock versus whatever type of a lock is not in '.fl_type' but is instead encoded as set bits in '.fl_flags'. Conveniently, all NFS client locks are POSIX locks so we don't have to care about this.)

  • 'struct inode' represents a generic, in-kernel inode. It contains an '.i_sb' pointer to its 'superblock' (really its mount), its '.i_ino' inode number, and '.i_flctx', which is a pointer to 'struct file_lock_context', which holds context for all of the locks associated with this inode; '.i_flctx->flc_posix' is the list of POSIX locks associated with this inode (there's also eg '.flc_flock' for flock locks).
  • 'struct file' represents an open file in the kernel, including files 'opened' by lockd/NLM in order to get locks on them for NFS clients. It contains a '.f_inode' that points to the file's associated 'struct inode', among other fields. If you want filename information about a struct file, you also want to look at '.f_path', which points to the file's 'struct path'; see include/linux/path.h and drgn's 'd_path()' helper.

  • 'struct nlm_file' is the lockd/NLM representation of a file held open by lockd/NLM in order to get a lock on it, and for obvious reasons has a pointer to the corresponding 'struct file'. For reasons I don't understand, this is actually stored in a two-element array, '.f_file[2]'; which element is used depends on whether the file was 'opened' for reading or writing.

There are two paths into determining what NFS client holds what (NFS) lock, the simple and the more involved. In the simple path, we can start by traversing all generic kernel locks somehow, which is to say we start with 'struct file_lock'. For each one, we check that '.fl_lmops' is 'nlmsvc_lock_operations' or that '.fl_pid' is lockd's PID, then cast '.fl_owner' to a 'struct nlm_lockowner *', dereference it and use its '.host' to reach the 'struct nlm_host'.

One way to do this is to use bpftrace to hook into 'lock_get_status()' in fs/locks.c, which is called repeatedly to print each line of /proc/locks and is passed a 'struct file_lock *' as its second argument (this also conveniently iterates all current file locks for you). We also have the struct file and thus the struct inode, which will give us identifying information about the file (the major and minor device numbers and its inode, which is the same information in /proc/locks). The 'struct nlm_host' has several fields of interest, including what seems to be the pre-formatted IP address in .h_addrbuf and the client's name for itself in .h_name.

So here's some bpftrace (not fully tested and you'll need to provide the lockd PID yourself, and also maybe include some header files):

kprobe:lock_get_status
/((struct file_lock *)arg1)->fl_pid == <your lockd PID>/
{
   $fl = (struct file_lock *)arg1;
   $nlo = (struct nlm_lockowner *)$fl->fl_owner;
   $ino = $fl->fl_file->f_inode;
   $dev = $ino->i_sb->s_dev;
   printf("%d: %02x:%02x:%ld inode %ld owned by %s ('%s')\n",
          (int64)arg2,
          $dev >> 20, $dev & 0xfffff, $ino->i_ino,
          $ino->i_ino,
          str($nlo->host->h_addrbuf),
          str($nlo->host->h_name));
}

(Now that I look at this a second time, you also want to look at the fifth argument, arg4 (an int32), because if it's non-zero I believe this is a pending lock, not a granted one. You may want to either skip them or print them differently.)

This will print the same indexes and (I believe) the same major:minor:inode information as /proc/locks, but add the NFS client information. To trigger it you must read /proc/locks, either directly or by using lslocks.

Another way is to use drgn to go through the global list of file locks, which is a per-cpu kernel hlist under the general name 'file_lock_list'. In interactive drgn, it appears that you traverse these lists as follows:

for i in for_each_present_cpu(prog):
  fll_cpu = per_cpu(prog['file_lock_list'], i)
  for flock in hlist_for_each_entry('struct file_lock', fll_cpu.hlist, 'fl_link'):
    [do whatever you want with flock]

I'm not quite sure if you want present CPUs, online CPUs, or possible CPUs. Probably you don't have locks for CPUs that aren't online.

The second path in is that the NFS NLM code maintains a global data structure of all 'struct nlm_file' objects, in 'nlm_files', which is an array of hlists, per fs/lockd/svcsubs.c. Starting with these 'nlm_file' structs, we can reach the generic file structs, then each file's inode, then the inode's lock context, and finally the POSIX locks in that lock context (since we know that all NFS locks are POSIX locks). This gives us a series of 'file_lock' structs, which puts us at the starting point above.

(The lock context '.flc_posix' is a plain list, not a hlist, and they're chained together with the '.fl_list' field in file_lock. Probably most inodes with NFS locks will have only a single POSIX lock on them.)

So we have more or less:

walk nlm_files to get a series of struct nlm_file → get one .f_file
.f_inode.i_flctx → walk .flc_posix to get a series of struct file_lock (probably you usually get only one)
→ check that .fl_lmops is nlmsvc_lock_operations to know you have an NFS lock, and then follow .fl_owner casting it as a struct nlm_lockowner *
→ .host → { .h_addrbuf, .h_name, and anything else you want from struct nlm_host }

If this doesn't make sense, sorry. I don't know a better way to represent data structure traversal in something like plain text.

(Also, having written this I've realized that you might need to make sure you visit each given inode only once. In theory multiple generic file objects can all point to the same inode, and so repeatedly visit its list of locks. I'm not sure this can happen with NFS locks; the lockd/NLM system may reuse nlm_file entries across multiple clients getting shared locks on the same file.)

Since starting from nlm_files requires several walks of list-like structures that will generate multiple entries and starting from a struct file_lock doesn't, you can see why I called the latter the simpler case. Now that I've found the 'file_lock_list' global and learned how to traverse it in drgn in the course of writing this entry, I don't think I'll use the 'nlm_files' approach in the future; it's strictly a historical curiosity of how I did it the first time around. And starting from the global file lock list guarantees you're reporting on each file lock only once.

(I was hoping to be able to spot a more direct path through the fs/lockd code, but the path I outlined above really seems to be how lockd does it. See, for example, 'nlm_traverse_locks()' in fs/lockd/svcsubs.c, which starts with a 'struct nlm_file *' and does the process I outlined above.)

NFSServerLockClients written at 22:29:58; Add Comment

2023-05-05

Some early praise for using drgn for poking into Linux kernel internals

I've been keeping my eyes on drgn (repository, 2019 LWN article) for some time, because it held promise for being a better way to poke around your Linux kernel than the venerable crash(8) program (which I've actually used in anger, and it was a lot of work). Today, for the first time, I got around to using drgn and the experience was broadly positive.

I used drgn on an Ubuntu 22.04 test NFS server, by creating a Python 3 venv, installing drgn into the venv, and then running it from there (after installing the necessary kernel debugging information from Ubuntu); this worked fine and 'drgn' gave me a nice interactive Python environment where with minimal knowledge of drgn itself I could poke around the kernel. Specifically, I could poke into the various data structures maintained by the kernel NFS NLM system, with the goal of being able to see which NFS client owned each NFS lock on the server (or in this case, a lock, since it was a test server and I established only a single lock to it for simplicity).

Drgn in interactive mode works quite well for this sort of exploration for a number of reasons. To start with it does a remarkably good job of pretty-printing structures (and arrays) with type and content information of all of the fields. Simply being able to see the contents of various things (and type information for pointers) led me to make some useful discoveries. However, sometimes you'll be confronted with things like this:

>>> prog['nlm_files']
(struct hlist_head [128]){
[...]
  {
    .first = (struct hlist_node *)0xffff8974099ae600,
  },

This is a message from drgn to you that you're going to be reading some kernel source code and kernel headers in order to figure out your next step. The good news is that drgn supports all of the kernel's normal ways of traversing these sorts of data structures, in a way that's very similar to the kernel's own code for it, to the point where an outsider like me can translate back and forth. For instance, if you have kernel code that looks like:

hlist_for_each_entry_safe(file, next, &nlm_files[i], f_list) {

Then the drgn equivalent you want (hard-coding the index by experimentation because this is exploration):

>>> r = list( hlist_for_each_entry('struct nlm_file', prog['nlm_files'][6].address_of_(), 'f_list') )
>>> r
[Object(prog, 'struct nlm_file *', value=0xffff8974099ae600)]

(We use list() for the usual Python reason that drgn's helper function returns a Python generator, and we want to poke at the actual results in a simple way. Also, technically these are in drgn.helpers.linux, which you may want to import specifically so you can read the help text for. Or see the user guide and the section on helpers.)

You'll also need to read kernel source code and kernel headers in order to dig your way through the kernel data structures to what you want. Drgn won't (and can't) tell you how NLM data structures are linked together and how you can go from, for example, the global 'nlm_files' to the 'struct nlm_host' that tells you the NFS client that got a particular lock. The path can be quite convoluted (cf).

The good news is that if the kernel can do it, drgn probably can do it too, although it may take you quite a bit of digging and persistence to get there. The further good news is that if you can do it in drgn's interactive mode, even painfully and with many mis-steps, you can probably turn your worked out process into Python code that uses drgn. Although I (temporarily) turned to other tools for now, being able to explore and test ideas with drgn was essential to getting there. Now that I've used drgn for this, I'll likely to be turning to it for similar explorations and information extraction in the future.

In addition to needing to know Python and be able to read kernel code and headers, drgn's other drawback is that you need kernel debugging information, and on most Linuxes these days that's not installed by default. Installing it may be a bit annoying and it's generally rather big; drgn's documentation has a guide. This means that drgn doesn't work out of the box the way tools like bpftrace do.

(It would be great if drgn could use the kernel's BPT Type Format (BTF) information, which bpftrace and other eBPF tools already use, but apparently there are various obstacles. I believe that drgn is tracking this in DWARFless Debugging #176.)

DrgnKernelPokingPraise written at 23:31:20; Add Comment

2023-05-04

Flock() and fcntl() file locks and Linux NFS (v3)

Unix broadly and Linux specifically has long had three functions that can do file locks, flock(), fcntl(), and lockf(). The latter two are collectively known as 'POSIX' file locks because they appear in the POSIX specification (and on Linux lockf() is just a layer over fcntl()), while flock() is a separate thing with somewhat different semantics (cf), as it originated in BSD Unix. In /proc/locks, flock() locks are type 'FLOCK' and fcntl()/lockf() locks are type 'POSIX', and you can see both on a local system.

(In one of those amusing things, in Ubuntu 22.04 crond takes a flock() lock on /run/crond.pid while atd takes a POSIX lock on /run/atd.pid.)

Because they're different types of locks, you can normally obtain both an exclusive flock() lock and an exclusive fcntl() POSIX lock on the same file. As a result of this, some programs adopted the habit of normally obtaining both sorts of locks, just to cover their bases for interacting with other unknown programs who might lock the file.

In the beginning on Linux (before 2005), flock() locks didn't work at all over NFS (on Linux); they were strictly local to the current machine, so two programs on two different machines could obtain 'exclusive' flock locks on the same file. Then 2.6.12's NFS client code was modified to accept flock() locks and silently change them into POSIX locks (that did work over NFS, in NFS v3 through the NLM protocol). This caused heartburn for programs and setups that were obtaining both sorts of (exclusive) locks on the same file, because obviously two POSIX locks conflict with each other and your NFS server will not let you have conflicting locks like that. This change is effectively invisible to the NFS client's kernel, so flock() locks on a NFS mounted filesystem will show up in the client's /proc/locks (and lslocks) as type 'FLOCK'. However, on your NFS server all locks from NFS clients are listed as type 'POSIX' in /proc/locks (and these days they're all 'owned' by lockd), because that is what they are.

(One reason for this is that the NFS v3 NLM protocol doesn't have an idea of different types of locks, apart from exclusive or non-exclusive.)

Unfortunately, this change creates another surprising situation, which is that the NFS server and a NFS client can both obtain an exclusive flock() lock on the same file. Two NFS clients trying to exclusively flock() the same file will conflict with each other and only one will succeed, but the NFS server and an NFS client won't, and both will 'win' the lock (and everyone loses). This is the inevitable but surprising consequence of client side flock() locks being changed to POSIX locks on the NFS server, and POSIX locks not conflicting with flock() locks. From the NFS server's perspective, it's not two flock() exclusive locks on a file; it's one exclusive POSIX lock (from a NFS client) and one exclusive local flock() lock, and that's nominally fine.

In my opinion, this makes using flock() locking dangerous in general, which is unfortunate since the flock command uses flock() and it's pretty much your best bet for locking in shell scripts (see also flock(1)). Flock() is only safe as a potentially cross-machine locking mechanism if you can be confident that your NFS server will never be doing anything except serving files via NFS. If things may be running locally on the NFS server, for example because you moved a very active NFS filesystem to the primary machine that uses it, then flock() becomes dangerous.

It also means that if you have a lock testing program, as I do, you should make it default to either fcntl() or lockf() locks, whichever you find easier, rather than flock() locks. Flock() has the easiest API out of the three locking functions, but it may give you results that are between misleading and wrong if you're trying to use it in a situation where you want to check locking behavior between a NFS server and a NFS client, as I did recently.

(Per nfs(5), you can use the local_lock mount option to make flock() locks purely local again on NFS v3 clients, but this doesn't exactly solve the problem.)

PS: Given the server flock() issue, I kind of wish there was a generic mount option to change flock() locks to POSIX locks, so that you could force this to happen to NFS exported filesystems even on your NFS fileserver. That would at least make the behavior the same on clients and the server.

(This elaborates on a learning experience I mentioned on the Fediverse.)

FlockFcntlAndNFS written at 23:13:16; Add Comment

2023-05-03

Forcefully breaking NFS locks on Linux NFS servers as of Ubuntu 22.04

As I discovered when I first explored /proc/locks, the Linux NFS server supports two special files in /proc/fs/nfsd that will get it to break some of the locks it holds, 'unlock_ip' and 'unlock_filesystem' (at least in theory). These files aren't currently documented in nfsd(7); the references for them are this 2016 linux-nfs message and thread and this Red Hat document on them. These appear to have originally been intended for failover situations, and one sign of this is that their names in fs/nfsd/nfsctl.c are 'NFSD_FO_UnlockIP' and 'NFSD_FO_UnlockFS'.

Each file is used by writing something to it. For 'unlock_filesystem' this is straightforward:

# echo /h/281 >unlock_filesystem

When you do this, all of the NFS locks on that filesystem are immediately dropped by the NFS server. Any NFS clients who think they held locks aren't told about this; as far as they know they have the lock. NFS clients that were waiting to get a lock (because the file was already locked) seem to eventually get given their lock. Because existing lock holders get no notification, this is only a safe operation to do if you're confident that there are no real locks on the filesystem on any NFS clients, and any locks you see on the NFS server are stuck NFS locks, where the NFS server thinks some NFS client has the file locked, but the NFS client disagrees.

We've tested doing this on our Ubuntu 22.04 fileservers (both in production and in a testing environment) and it appears to work and not have any unexpected side effects. It turns out that contrary to what I thought in my /proc/locks update for 22.04, the Ubuntu 22.04 lslocks can still mis-parse /proc/locks under some circumstances; this is util-linux issue #1633, which will only be fixed in v2.39 when it gets released. Until then, build from source or bug your distribution to pull in a fix.

(I had forgotten I'd filed issue #1633 last year and it had gotten fixed back then and only re-discovered it while writing this entry.)

I was going to write a number of things about 'unlock_ip', but it turns out that all I can write about this file is that I can't get it to do anything. The kernel source code is in conflict between whether the IP address you write is supposed to be a client IP address (comments in fs/nfsd/nfsctl.c) or the server's IP address as seen by clients (comments in fs/lockd/svcsubs.c; the Red Hat Page talks about failover in a way that suggests it was originally intended for failover and to be given a (failover) server IP address. And in practice on our testing Ubuntu 22.04 NFS fileserver, writing either IP address to 'unlock_ip' makes no difference in what /proc/locks says about locks (and how other NFS clients waiting for locks react).

If 'unlock_ip' worked, it would behave much the same for releasing locks as rebooting the NFS client, but without the whole 'reboot' business. Obviously you'd need to be very sure that the NFS client didn't actually think it had any NFS locks on the particular NFS server. Unfortunately Linux has no easy way to send an artificial 'I have rebooted' notification to a particular NFS server; however, you can use sm-notify(8) on an NFS client to tell all of the NFS servers that the client talks to that the client has 'rebooted', which will cause all of them to release their locks.

(Temporarily shutting down everything on a NFS client that might try to get a NFS lock may be easier than rebooting it entirely. Also, with enough contortions you could probably make sm-notify(8) send notifications to only a single fileserver, but it's clearly not how sm-notify is intended to be used.)

NFSServerBreakingLocks written at 23:12:24; Add Comment

2023-04-28

More notes on Linux's /proc/locks and NFS as of Ubuntu 22.04

About a year ago, when we were still running our NFS fileservers on Ubuntu 18.04, I investigated /proc/locks a bit (it's documented in the proc(5) manual page). Since then we've upgraded our fileservers to Ubuntu 22.04 (which uses Ubuntu's '5.15.0' kernel), and there's some things that are a bit different now, especially on NFS servers.

(Update: oops, I forgot to link to the first entry on /proc/locks.)

On our Ubuntu 22.04 NFS servers, two things are different from how they were in 18.04. First, /proc/locks appears to be complete now, in that it shows all current locks held by NFS clients on NFS exported filesystems. Along with this, the process ID in /proc/locks for such NFS client locks is now consistently the PID of the kernel 'lockd' thread. This gives you a /proc/locks that looks like this:

1: POSIX  ADVISORY  WRITE 13602 00:4f:2237553 0 EOF
2: POSIX  ADVISORY  WRITE 13602 00:2e:486322 0 EOF
3: POSIX  ADVISORY  WRITE 13602 00:2e:485496 0 EOF
4: POSIX  ADVISORY  WRITE 13602 00:2e:486562 0 EOF
5: POSIX  ADVISORY  WRITE 13602 00:2e:486315 0 EOF
6: POSIX  ADVISORY  WRITE 13602 00:2e:541938 0 EOF
7: POSIX  ADVISORY  WRITE 13602 00:4a:2602201 0 EOF
8: POSIX  ADVISORY  WRITE 13602 00:2b:7233288 0 EOF
9: POSIX  ADVISORY  WRITE 13602 00:4a:877382 0 EOF
10: POSIX  ADVISORY  WRITE 13602 00:4a:877913 0 EOF
11: FLOCK  ADVISORY  WRITE 9990 00:19:4993 0 EOF
[...]

All of those locks except the last one are NFS locks 'held' by the lockd thread. If you use lslocks(8) it shows 'lockd' (and the PID), making it easy to scan for NFS locks. Lslocks is no more able to find out the actual name of the file than it was before, because the kernel 'lockd' thread doesn't have them open and so lslocks can't do its trick of looking in /proc/<pid>/fd for them.

(Your /proc/locks on a 22.04 NFS server is likely to be bigger than it was on 18.04, possibly a lot bigger.)

The Ubuntu 22.04 version of lslocks is not modern enough to be able to list the inode of these locks (which is available in /proc/locks). However a more recent version of util-linux does have such a version of lslocks; support for listing the inode number was added in util-linux 2.38, and it's not that difficult to build your own copy of lslocks on 22.04. The version I built is willing to use the shared libraries from the Ubuntu util-linux package, so you can just pull the built binary out.

(Locally I wrote a cover script that runs our specially built modern lslocks with '-u -o COMMAND,TYPE,MODE,INODE,PATH', because if we're looking into NFS locks on a fileserver the other information usually isn't too useful.)

These two changes make it much easier to diagnose or rule out 'stuck' NFS locks, because now you can reliably see all of the locks that the NFS server does or doesn't hold, and verify if one of them is for the file that just can't be successfully locked on your NFS clients. If you have access to all of the NFS clients that mount a particular filesystem, you can also check to be sure that none of them have a file locked that the server lists as locked by lockd.

(Actually dealing with such a stuck lock is beyond the scope of this entry. There is a traditional brute force option and some other approaches.)

ProcLocksNotesII written at 21:51:33; Add Comment

2023-04-20

Setting the ARC target size in ZFS on Linux (as of ZoL 2.1)

In the past I've grumbled about wanting a way to explicitly set the (target) ARC size. After all of my recent investigation into how the ARC grows and shrinks, I now believe that this can be safely done, at least some of the time. However, growing (or in general resizing) the ZFS ARC comes with a number of caveats, because it's only going to be effective some of the time.

The simple and brute force way to grow the ARC target size to a given number is to briefly and temporarily raise zfs_arc_min to your desired value, which can be done through /sys/module/zfs/parameters. After having spent some time going through the ARC code, I'm relatively convinced that this is safe and won't trigger immediate consequences. You can similarly reduce the ARC target size by (temporarily) lowering zfs_arc_max. In both cases this has an immediate effect on 'c', the ARC target size; when you set either the maximum or the minimum, the ZFS code immediately sets 'c' if it's necessary.

However, raising the ARC target size will only have a meaningful effect if ZFS can actually use more memory. If the free memory situation is bad enough that memory_available_bytes is negative, your newly set ARC target size will pretty much immediately start shrinking, possibly significantly, and the ARC will have no chance to actually use much more extra memory. If available memory is positive but not very large, it may turn negative once the ARC's actual size grows a bit more and then ZFS will shrink your recently-raised ARC target size back down, along with probably shrinking the ARC's actual memory use.

Given all of this, there seem to be two good cases to deliberately raise the ARC target size. The first case is if you've seen an odd collapse in the ARC target size and you have a lot of free memory. Here the ARC target size will probably grow on its own, eventually, but it will likely do that in relatively small increments (such as 128 KiB at a time), while you can yank it right up now. The second case is if the ARC target size is already quite big but arc_no_grow is stuck at '1' because ZFS wants an extra 1/32nd of your large target size to be available; this is probably more likely to be an issue if you've raised zfs_arc_max (as we have on our fileservers).

(As far as I can tell from looking at the code, arc_no_grow being 1 doesn't prevent the ARC from allocating extra memory to grow up to the ARC target size; it just prevents the ARC target size from growing further.)

In theory you can lock the ARC target size at a specific value by boxing it in by setting zfs_arc_min to sufficiently close to zfs_arc_max. While this will keep ZFS from lowering the target size, it won't keep either ZFS or the general kernel 'shrinker' memory management feature from frantically trying to reclaim memory from the ARC if actual available memory isn't big enough. Fighting the kernel is probably not going to give you great results.

ZFSOnLinuxSettingARCSize written at 22:58:04; Add Comment

2023-04-18

When and how ZFS on Linux changes the ARC target size (as of ZoL 2.1)

Previously I discussed the various sizes of the ARC, some important ARC memory stats, and ARC memory reclaim stats. Today I can finally talk about how the ZFS ARC target size shrinks, and a bit about how it grows, which is a subject of significant interest and some frustration. I will be citing ZoL function names because tools like bpftrace mean you can hook into them to monitor ARC target size changes.

(Changes in the actual size of the ARC are less interesting than changes in the ARC target size. Generally the actual size promptly fills up to the target size if you're doing enough IO, although metadata versus data balancing can throw a wrench in the works.)

The ARC target size is shrunk by arc_reduce_target_size() (in arc.c), which takes as its argument the size (in bytes) to reduce arc_c by and almost always does so (unless you've hit the minimum size). There are two paths to calling it, through reaping, where ZFS periodically checks to see if it thinks there's not enough memory available, and shrinking, where the Linux kernel memory management system asks ZFS to shrink its memory use.

Reaping is a general ZFS facility where a dedicated kernel thread wakes up at least once every second to check if memory_available_bytes is negative. If it is, ZFS sets arc_no_grow, kicks off reclaiming memory, waits about a second, and then potentially shrinks the ARC target size by:

( (arc_c - arc_c_min) / 128 ) - memory_available_bytes

(The divisor will be different if you've tuned zfs_arc_shrink_shift. This is done in arc_reap_cb(), and see also arc_reap_cb_check().)

Because reaping waits a second after starting the reclaim, this number may not be positive (because the reclaim raised the amount of available bytes enough); if this has happened, arc_c is left unchanged. This reaping thread ticks once a second and may also be immediately woken up by arc_adapt(), which is called when ZFS is reading a new disk block into memory and which will check to see if memory_available_bytes is below zero.

My bpftrace-based measurements so far suggest that when reaping triggers, it normally makes relative large adjustments in the ARC target size; I routinely see 300 and 400 MiB reductions even on my desktops. Since the ARC target size reduction starts out at 1/128th of the difference between the current ARC target size and the minimum size, a system with a lot of memory and a large ARC size may experience very abrupt drops through reaping, especially if you've raised the maximum ARC size and left the minimum size alone.

The shrinking path is invoked through the Linux kernel's general memory management feature of kernel subsystems having 'shrinkers' that kernel memory management can invoke to reduce the subsystem's memory usage (this came up in memory reclaim stats). When the kernel's memory management decides that it wants subsystems to shrink, it will first call arc_shrinker_count() to see how much memory the ARC can return and then maybe call arc_shrinker_scan() to actually do the shrinking. The amount of memory ARC will claim it can return is calculated in a complex way (see yesterday's discussion) and is capped at zfs_arc_shrinker_limit pages (normally 4 KiBytes each). All of this is in arc_os.c. Shrinking, unlike reaping, always immediately reduces arc_c by however much the kernel wound up asking it to shrink by.

Although you might expect otherwise, the kernel's memory subsystem can invoke the ARC shrinker even without any particular sign of memory pressure, and when it does so it often only asks the ARC to drop 128 pages (512 KiB) of data instead of the full amount that the ARC offers. It can also do this in rapid bursts, which obviously adds up to much more than just 512 KiB of total ARC target size reduction.

Every time shrinking happens, one or the other of memory_indirect_count and memory_direct_count are increased. No statistic is increased if reaping happens, or if reaping leads to the ARC target size being reduced (which it doesn't always). If you need that information, you'll have to instrument things with something like the EBPF exporter. Writing the relevant BCC or bpftrace programs is up to you.

How and when the ARC target size is increased again is harder to observe, although it's more centralized. The ARC target size is grown in arc_adapt(), but unfortunately not all of the time; it's only grown if the current ARC size is within 32 MiBytes of the target ARC size (and the ARC can grow at all, ie arc_no_grow is zero and there's no reclaim needed). As of ZoL 2.1, the ARC target size is grown by however many bytes were being read from disk, which may be as small as 4 KiB; in the current development version, that's changed to a minimum of 128 KiB. As mentioned before, arc_adapt() seems to be called only when ZFS wants to read new things from disk (with a minor exception for some L2ARC in-RAM structures).

(That the growth decision is buried away inside the depths of arc_adapt() makes it hard to monitor even with bpftrace, especially since arc_c itself isn't accessible to bpftrace.)

One consequence of this is that even if the ARC target size can grow, it only grows on ARC misses that trigger disk IO. If all of your requests are being served from the current ARC, ZFS won't bother growing the target size. This makes sense, but is potentially frustrating and I believe it can cause the ARC target size to 'stick' at alarmingly low levels for a while on a system that still has high ARC hit rates even on a reduced-size ARC, or low IO levels.

Sidebar: the shrinker call stack bpftrace has observed

I had bpftrace print call stacks for arc_shrinker_scan(), and what I got in my testing was:

arc_shrinker_scan+1
do_shrink_slab+318
shrink_slab+170
shrink_node+572
balance_pgdat+792
kswapd+496
[...]

I lack the energy to try to decode why the kernel would go down this particular path and what kernel memory metrics one would look at to predict it.

ZFSOnLinuxARCTargetSizeChanges written at 22:48:56; Add Comment

(Previous 10 or go back to April 2023 at 2023/04/17)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.