Wandering Thoughts


A gotcha with Fedora 30's switch of Grub to BootLoaderSpec based configuration

I upgraded my office workstation from Fedora 29 to Fedora 30 yesterday. In the past, such upgrades been problem free, but this time around things went fairly badly, with the first and largest problem being that after the upgrade, booting any kernel gave me a brief burst of kernel messages, then a blank screen and after a few minutes a return to the BIOS and Grub main menu. To get my desktop to boot at all, I had to add 'nomodeset' to the kernel command line; among other consequences, this made my desktop a single display machine instead of a dual display one.

(It was remarkably disorienting to have my screen mirrored across both displays. I kept trying to change to the 'other' display and having things not work.)

The short version of the root cause is that my grub.cfg was rebuilt using outdated kernel command line arguments that came from /etc/default/grub, instead of the current command line arguments that had previously been used in my original grub.cfg. Because of how the Fedora 30 grub.cfg is implemented, these wrong command line arguments were then remarkably sticky and it wasn't clear how to change them.

In Fedora 29 and earlier, your grub.cfg is probably being maintained through grubby, Fedora's program for this. When grubby adds a menu entry for a new kernel, it more or less copies the kernel command line arguments from your current one. While there is a GRUB_CMDLINE_LINUX setting in /etc/default/grub, its contents are ignored until and unless you rebuild your grub.cfg from scratch, and there's nothing that tries to update it from what your current kernels in your current grub.cfg are actually using. This means that your /etc/default/grub version can wind up being very different from what you're currently using and actually need to make your kernels work.

One of the things that usually happens by default when you upgrade to Fedora 30 is that Fedora switches how grub.cfg is created and updated from the old way of doing it itself via grubby to using a Boot Loader Specification (BLS) based scheme; you can read about this switch in the Fedora wiki. This switch regenerates your grub.cfg using a shell script called (in Fedora) grub2-switch-to-blscfg, and this shell script of course uses /etc/default/grub's GRUB_CMDLINE_LINUX as the source of the kernel arguments.

(This is controlled by whether GRUB_ENABLE_BLSCFG is set to true or false in your /etc/default/grub. If it's not set at all, grub2-switch-to-blscfg adds a 'GRUB_ENABLE_BLSCFG=true' setting to /etc/default/grub for you, and of course goes on to regenerate your grub.cfg. grub2-switch-to-blscfg itself is run from the Fedora 30 grub2-tools RPM posttrans scriptlet if GRUB_ENABLE_BLSCFG is not already set to something in your /etc/default/grub.)

A regenerated grub.cfg has a default_kernelopts setting, and that looks like it should be what you want to change. However, it is not. The real kernel command line for normal BLS entries is actually in the Grub2 $kernelopts environment variable, which is loaded from the grubenv file, normally /boot/grub2/grubenv (which may be a symlink to /boot/efi/EFI/fedora/grubenv, even if you're not actually using EFI boot). The best way to change this is to use 'grub2-editenv - list' and 'grub2-editenv - set kernelopts="..."'. I assume that default_kernelopts is magically used by the blscfg Grub2 module if $kernelopts is unset, and possibly gets written back to grubenv by Grub2 in that case.

(You can check that your kernels are using $kernelopts by inspecting an entry in /boot/loader/entries and seeing that it has 'options $kernelopts' instead of anything else. You can manually change that for a specific entry if you want to.)

This is going to make it more interesting (by which I mean annoying) if and when I need to change my standard kernel options. I think I'm going to have to change all of /etc/default/grub, the kernelopts in grubenv, and the default_kernelopts in grub.cfg, just to be sure. If I was happy with the auto-generated grub.cfg, I could just change /etc/default/grub and force a regeneration, but I'm not and I have not yet worked out how to make its handling of the video modes and the menus agree with what I want (which is a basic text experience).

(While I was initially tempted to leave my system as a non-BLS system, I changed my mind because of long term issues. Fedora will probably drop support for grubby based setups sooner or later, so I might as well get on the BLS train now.)

To give credit where it's due, one (lucky) reason that I was able to eventually work out all of this is that I'd already heard about problems with the BLS transition in Fedora 30 in things like Fedora 30: When grub2-mkconfig Doesn’t Work, and My experiences upgrading to Fedora 30. Without that initial awareness of the existence of the BLS transition in Fedora 30 (and the problems it caused people), I might have been flailing around for even longer than I was.

PS: As a result of all of this, I've discovered that you no longer need to specify the root device in the kernel command line arguments. I assume the necessary information for that is in the dracut-built initramfs. As far as the blank screen and kernel panics go, I suspect that the cause is either or both of 'amdgpu.dpm=0' and 'logo.nologo', which were still present in the /etc/default/grub arguments but which I'd long since removed from my actual kernel command lines.

(I could conduct more experiments to try to find out which kernel argument is the fatal one, but my interest in more reboots is rather low.)

Fedora30GrubBLSGotcha written at 20:58:09; Add Comment

Systemd and waiting until network interfaces or addresses are configured

One of the things that systemd is very down on is the idea of running services after 'the network is up', whatever that means; the systemd people have an entire web page on the subject. This is all well and good in theory, but in practice there are plenty of situations where I need to only start certain things after either a named network interface is present or an IP address exists. For a concrete example, you can't set up various pieces of policy based routing for an interface until the interface actually exists. If you're configuring this on boot in a systemd based system (especially one using networkd), you need some way to insure the ordering. Similarly, sometimes you need to listen only on some specific IP addresses and the software you're using doesn't have Linux specific hacks to do that when the IP address doesn't exist yet.

(As a grumpy sysadmin, I actually don't like the behavior of binding to an IP address that doesn't exist, because it means that daemons will start and run even if the system will never have the IP address. I would much rather delay daemon startup until the IP address exists.)

Systemd does not have direct native support for any of this, of course. There's no way to directly say that you depend on an interface or an IP address, and in general the dependency structure has long been under-documented. The closest you can get to waiting until a named network interface exists is to specify an After= and perhaps a Want= or a Requires= on the pseudo-unit for the network interface, 'sys-subsystem-net-devices-<iface>.device'. However, as I found out, the lack of a .device unit doesn't always mean that the interface doesn't exist.

You might think that in order to wait for an IP address to exist, you could specify an After= for the .device unit it's created in and by. However, this has historically had issues for me; under at least some versions of systemd, the .device unit would be created before the IP address was configured. In my particular situation, what worked at the time was to wait for a VLAN interface .device that was on top of the real interface that had the IP address (and yes, I mix tagged VLANs with an untagged network). By the time the VLAN .device existed, the IP address had relatively reliably been set up.

If you're using systemd-networkd and care about network interfaces, the easiest approach is probably to rely on systemd-networkd-wait-online.service; how it works and what it waits for is probably about as good as you can get. For IP addresses, as far as I know there's no native thing that specifically waits until some or all of your static IP addresses are present. Waiting for systemd-networkd-wait-online is probably going to be good enough for most circumstances, but if I needed better I would probably write a shell script (and a .service unit for it) that simply waited until the IP addresses I needed were present.

(I continue to think that it's a real pity that you can't configure networkd .network files to have 'network up' and 'network down' scripts, especially since their stuff for routing and policy based routing is really very verbose.)

PS: One of the unfortunate effects of the under-documented dependency structure and the lack of clarity of what to wait on is a certain amount of what I will call 'superstitious dependencies', things that you've put into your systemd units without fully understanding whether or not you needed them, and why (often also without fully documenting them). This is fine most of the time, but then one day an unnecessary dependency fails to start or perhaps exist and then you're unhappy. That's part of why I would like explicit and reliable ways to do all of this.

SystemdNetworkThereIssue written at 00:26:26; Add Comment


Linux can run out of memory without triggering the Out-Of-Memory killer

If you have a machine with strict overcommit turned on, your memory allocation requests will start to fail once enough virtual address space has been committed, because that's what you told the kernel to do. Hitting your strict overcommit limit doesn't trigger the Out-Of-Memory killer, because the two care about different things; strict memory overcommit cares about committed address space, while the global OOM killer cares about physical RAM. Hitting the commit limit may kill programs anyway, because many programs die if their allocations fail. Also, under the right situations, you can trigger the OOM killer on a machine set to strict overcommit.

Until recently, if you had asked me about how Linux behaved in the default 'heuristic overcommit' mode, I would have told you that ordinary memory allocations would never fail in it; instead, if you ran out of memory (really RAM), the OOM killer would trigger. We've recently found out that this is not the case, at least in the Ubuntu 18.04 LTS '4.15.0' kernel. Under (un)suitable loads, various of our systems can run out of memory without triggering the OOM killer and persist in this state for some time. When it happens, the symptoms are basically the same as what happens under strict overcommit; all sorts of things can't fork, can't map shared libraries, and so on. Sometimes the OOM killer is eventually invoked, other times the situation resolves itself, and every so often we have to reboot a machine to recover it.

I would like to be able to tell you why and how this happens, but I can't. Based on the kernel code involved, the memory allocations aren't being refused because of heuristic overcommit, which still has its very liberal limits on how much memory you can ask for (see __vm_enough_memory in mm/util.c). Instead something else is causing forks, mmap()s of shared libraries, and so on to fail with 'out of memory' errno values, and whatever that something is it doesn't trigger the OOM killer during the failure and doesn't cause the kernel to log any other messages, such as the ones you can see for page allocation failures.

(Well, the messages you see for certain page allocations. Page allocations can be flagged as __GFP_NOWARN, which suppresses these.)

PS: Unlike the first time we saw this, the recent cases have committed address space rising along with active anonymous pages, and the kernel's available memory dropping in sync and hitting zero at about the time we see failures start.

NoMemoryButNoOOM written at 22:23:33; Add Comment


Roughly when the Linux Out-Of-Memory killer triggers (as of mid-2019)

For reasons beyond the scope of this entry, I've recently become interested in understanding more about when the Linux OOM killer does and doesn't trigger, and why. Detailed documentation on this is somewhat sparse and and some of it is outdated (eg). I can't add detailed documentation, because doing that requires fully understanding kernel memory management code, but I can at least write down some broad overviews for my own use.

(All of this is as of the current Linux kernel git tree, because that's what I have on hand. The specific details change over time, although the code seems broadly unchanged between git tip and the Ubuntu 18.04 LTS kernel, which claims to be some version of 4.15.)

These days there are two sort of different OOM killers in the kernel; there is the global OOM killer and then there is cgroup-based OOM through the cgroup memory controller, either cgroup v1 or cgroup v2. I'm primarily interested in when the global OOM killer triggers, partly because the cgroup OOM killer is relatively more predictable.

The simple answer is that the global OOM killer triggers when the kernel has problems allocating pages of physical RAM. When the kernel is attempting to allocate pages of RAM (for whatever use, either for kernel usage or for processes that need pages) and initially fails, it will try various ways to reclaim and compact memory. If this works or at least makes some progress, the kernel keeps retrying the allocation (as far as I can tell from the code); if they fail to free up pages or make progress, it triggers the OOM killer under many (but not all) circumstances.

(The OOM killer is not triggered if, for instance, the kernel is asking for a sufficiently large number of contiguous pages, cf. At the moment, the OOM killer is still only invoked for contiguous allocations of 32 Kb or less (order 3), which is the same as it was back in 2012; in fact, 'git blame' says this dates from 2007.)

As far as I can tell, there's nothing that stops the OOM killer being triggered repeatedly for the same attempted page allocation. If the OOM killer says it made progress, the page allocation is retried, but there's probably no guarantee that you can get memory now (any freed memory might have been grabbed by another request, for example). Similarly, as far as I can tell the OOM killer can be invoked repeatedly in close succession; there doesn't seem to be any 'must be X time between OOM kills' limits in the current code. The trigger is simply that the kernel needs pages of RAM and it can't seem to get them any other way.

(Of course you hope that triggering the OOM killer once frees up a bunch of pages of RAM, since that's what it's there for.)

The global OOM killer is not particularly triggered when processes simply allocate (virtual) memory, because this doesn't necessarily allocate physical pages of RAM. Decisions about whether or not to grant such memory allocation requests are not necessarily independent of the state of the machine's physical RAM, but I'm pretty sure you can trigger the OOM killer without having reached strict overcommit limits and you can definitely have memory allocation requests fail without triggering the OOM killer.

In the current Linux tree, you can see this sausage being made in mm/page_alloc.c's __alloc_pages_slowpath. mm/oom_kill.c is concerned with actually killing processes.

PS: I will avoid speculating about situations where this approach might fail to trigger the OOM killer when it really should, but depending on how reclaim is implemented, there seem to be some relatively obvious possibilities.

Sidebar: When I think cgroups OOM is triggered

If you're using the memory cgroup controller (v1 or v2) and you set a maximum memory limit, this is (normally) a limit on how much RAM the cgroup can use. As the cgroup's RAM usage grows towards this limit, the kernel memory system will attempt to evict the cgroup's pages from RAM in various ways (such as swapping them out). If it fails to evict enough pages fast enough and the cgroup runs into its hard limit on RAM usage, the kernel triggers the OOM killer against the cgroup.

This particular sausage seems to be made in mm/memcontrol.c. You want to look for the call to out_of_memory and work backward. I believe that all of this is triggered by any occasion when a page of RAM is charged to a cgroup, which includes more than just the RAM directly used by processes.

(In common configurations, I believe that a cgroup with such a hard memory limit can consume all of your swap space before it triggers the OOM killer.)

If you want to know whether a OOM kill was global or from a cgroup limit, this is in the kernel message. For a cgroup OOM kill, the kernel message will look like this:

Memory cgroup out of memory: Kill process ... score <num> or sacrifice child

For a global out of memory, the kernel message will look like this:

Out of memory: Kill process ... score <num> or sacrifice child

I sort of wish the global version specifically mentioned that it was a non-cgroup OOM kill, but you can see how it wound up this way; when cgroup OOM kills were introduced, I suspected that no one wanted to change the existing global OOM kill message for various reasons.

OOMKillerWhen written at 23:13:12; Add Comment


Rewriting my iptables rules using ipsets

On Mastodon, I was tempted:

My home and office workstation have complicated networking, but their firewall rules are actually relatively simple. Maybe it's time to switch them over from annoying iptables to the new shiny nftables stuff, which might at least be more readable (and involve less repetition).

Feedback convinced me to not go that far. Instead, today I rewrote my iptables rules in terms of ipsets (with multiple set matches), which eliminated a great deal of their prior annoyance (although not all of it).

My workstation firewall rules did not previously use ipsets because I first wrote them before ipsets were a thing; in fact, they date from the days of ipchains and Linux 2.2. In the pre-ipset world, this meant a separate iptables rule for each combination of source IP, destination port, and protocol that I wanted to block (or allow). On my office workstation, this wound up with over 180 INPUT table rules (most of them generated automatically).

Contrary to what I asserted a few years ago, most of the actual firewall rules being expressed by all of these iptables rules are pretty straightforward. Once I simplified things a bit, there are some ports that only my local machine can access, some ports that only 'friendly' machines can access, and some machines I don't like that should be blocked from a large collection of ports, even ones that are normally generally accessible. This has an obvious translation to ipset based rules, especially if I don't try to be too clever, and the result is a lot fewer rules that are a lot easier to look over. There's still some annoying repetition because I want to match both the TCP and UDP versions of most ports, but I can live with that.

(Enough of the ports that I want to block access to come in both TCP and UDP versions that it's not worth making a finer distinction. That would lead to more ipsets, which is more annoying in practice.)

When I did the rewrite, I did simplify some of the fine distinctions I had previously made between various ports and various machines. I also dropped some things that were obsolete, both in terms of ports that I was blocking and things like preventing unencrypted GRE traffic, since I no longer use IPsec. I could have done this sort of reform without a rewrite, but I had nothing to push me to do it until now and it wouldn't have been as much of a win. The actual rewrite was a pretty quick process and the resulting shell script is what I consider to be straightforward.

(The new rules also have some improvements; for example, I now have some IPv6 blocks on my home machine. Since I already had an ipset of ports, I could say 'block incoming IPv6 traffic from my external interface to these ports' in a single ip6tables rule.)

As far as I'm concerned, so far the three big wins of the rewrite are that 'iptables -nL INPUT' no longer scrolls excessively, I'm no longer dependent on my ancient automation to generate iptables rules, and I've wound up writing a shell script to totally clear out all of my iptables rules and tables (because I kept wanting to re-run my setup script as I changed it). That my ancient automation silently broke for a while (again) and left my office workstation without most of its blocks since late March is one thing that pushed me into making this change now.

(Late March is when I updated to my first Fedora 5.x kernel, and guess what my ancient automation threw up its hands at. If you're curious why, it was (still) looking at the kernel version to decide whether to use ipchains or iptables.)

IptablesRewriteUsingIpset written at 01:16:36; Add Comment


Some notes on understanding how to use flock(1)

The flock(1) command has rather complicated usage, with a bunch of options, which makes it not entirely clear how to use it for shell script locking in various different circumstances. Here are some notes on this, starting with understanding what it's doing and the implications of that.

The key to understanding all of flock(1)'s weird options is to know that flock(2) automatically releases your lock when the last copy of the file descriptor you locked is closed, and that file descriptors are shared with child processes. Given this, we can start with the common basic flock(1) usage of:


flock(1) opens LOCKFILE, locks it with flock(2), and then starts the shell script, which will inherit an open (and locked) file descriptor for LOCKFILE. As long as the shell script process or any sub-process it starts still exists with that file descriptor open, the lock is held, even if flock(1) itself is killed for some reason.

This is generally what you want; so long as any component of the shell script and the commands it runs is still running, it's potentially not safe to start another copy. Only when everything has exited, flock included, is the lock released.

However, this is perhaps not what you want if flock is used to start a daemon that doesn't close all of its file descriptors, because then the daemon will inherit the open (and locked) file descriptor for LOCKFILE and the lock will never be released. If this is the case, you want to start flock with the -o option, which does not pass the open file descriptor for LOCKFILE to the commands that flock winds up running:


Run this way, the only thing holding the lock is flock itself. When flock exits (for whatever reason), the file descriptor will be closed and the lock released, even if SHELL-SCRIPT is still running.

(Of course, having a daemon inherit an open and locked file descriptor for LOCKFILE is a convenient way to only have one copy of the daemon running. As long as the first copy is still running, further attempts to get the lock will fail; if it exits, the lock is released.)

The final usage is that flock(1) can be directly told the file descriptor number to lock. In order to be useful, this requires some shared file descriptor that will live on after flock exits; the usual place to get this is by redirecting some file descriptor of your choice to or from a file for an entire block of a shell script, like this:

flock -n 9 || exit 1
... # locked commands
) 9>/var/lock/mylockfile

This is convenient if you only want to lock some portions of a shell script or don't want to split a shell script into two, especially since the first will just be 'flock -n /var/lock/mylockfile script-part-2'. On the other hand, it is sort of tricky and clever, perhaps too clever. I'd certainly want to comment it heavily in any shell script I wrote.

However, you don't necessarily have to go all the way to doing this if you just want to flock some stuff that involves shell operations like redirecting files and so on, because you can use 'flock -c' to run a shell command line instead of just a program:

flock -n LOCKFILE -c '(command1 | command2 >/some/where) && command3'

This can also get too tricky, of course. There's only so much that's sensible to wedge into a single shell command line, regardless of what's technically possible.

Once you're locking file descriptors, you can also unlock file descriptors with 'flock -u'. This is probably useful mostly if you're going to unlock and then re-lock, and that probably wants you to be using flock without the '-n' option for at least the re-lock. I imagine you could use this in a shell script loop, for example something like:

for file in "$@"; do
  flock 9; big-process "$file"; flock -u 9
  more-work ...
) 9>/var/lock/mylockfile

This would allow more-work to run in parallel with another invocation's big-process, while not allowing two big-process's to be running at once.

(This feels even more tricky and clever than the basic usage of flock'ing a file descriptor in a shell '( ... )' block, so I suspect I'll never use it.)

FlockUsageNotes written at 22:19:29; Add Comment


If you can, you should use flock(1) for shell script locking

I have in the past worked out and used complicated but portable approaches for doing locking in shell scripts. These approaches get much more complicated if processes can die abruptly without undoing their lock. You can generally arrange things so that your locks are cleared if the entire machine reboots, but that's about it as far as simple approaches go. Sometimes this is what you want, but often it isn't.

As a result of a series of issues with our traditional shell script locking, I have been more and more moving to using Linux's flock(1) when I can, which is to say for scripts that only have to run on our Linux machines (which is almost all of our machines today). flock is sufficiently useful and compelling here that I might actually port it over to other Unixes if we had to integrate such systems into our current Linux environment.

(Anything we want to use should have flock(2), and hopefully that's the only thing the flock program really depends on.)

There are two strongly appealing sides to flock. The first is that it provides basically the usage that we want; in normal operation, it runs something with the lock held and releases the lock when the thing exits. The second is that it automatically releases the lock if something goes wrong, because flock(2) locks evaporate when the file descriptor is closed.

(The manpage's description of '-o' may make you confused about this; what flock means is that the open file descriptor of the lock is not inherited by the command flock runs. Normally you want the command to inherit the open file descriptor, because it means that so long as any process involved is still running, the lock is held, even if flock itself gets killed for some reason.)

Generally I want to use 'flock -n', because we mostly use locking for 'only one of these should ever be running at once'; if the lock is held, a previous cron job or whatever is still active, so the current one should just give up.

We have one script using a traditional shell script approach to locking that I very carefully and painfully revised to be more or less safe in the face of getting killed abruptly. Since it logs diagnostics if it detects a stale lock, there's a certain amount of use in having it around, but I definitely don't want to ever have to do another script like it, and it's a special case in some other ways that might make it awkward to use with flock. The experience of revising that script is part of what pushed me very strongly to using flock for others.

LockShellScriptsWithFlock written at 22:06:18; Add Comment

Getting NetworkManager to probably verify TLS certificates for 802.1x networks

I'll start with my tweets:

We have an 802.1.X WPA2 Enterprise university-wide wireless network, using PEAPv0 authentication, which involves a TLS certificate. I do not appear to be able to get NetworkManager to verify the TLS certificate in a way that will let me actually connect.

The only way I can connect to our university wifi is by setting 'No CA certificate is required'. I cannot supply a CA certificate that works (I've tried), and I cannot turn on 802-1x.system-ca-certs ; nmcli just doesn't save it, no matter what, without any reported error.

With the aid of some replies from @grawity, I was able to navigate to a solution that allows me to connect without that 'No CA certificate is required' having to be set, and probably even verifies the TLS certificate.

The magic trick for me was telling NetworkManager that it should use the system bundle of TLS certificate as the 'CA certificate' it wants. The one important trick is that NetworkManager wants the PEM format certificate bundle (and/or certificate), not the DER form. How you tell them apart is that the PEM form is base64 ASCII while the DER form is binary. Anything with a .pem extension had better be a PEM file, but a .crt extension can be either.

On Fedora 29, the system certificate bundle is found as either /etc/ssl/certs/ca-bundle.crt or /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem; the former is a symlink to the latter. On Ubuntu and Debian, you want /etc/ssl/certs/ca-certificates.crt. I don't know if there are any special SELinux considerations that apply depending on the path you select, because I turned that off long ago on my laptop.

I don't know if this setup makes NetworkManager actually verify the TLS certificates (or perhaps wpa_supplicant, which is apparently the thing that really does the work even when NetworkManager is being the frontend). But at least I'm not telling NetworkManager to maybe ignore TLS security entirely.

(When I was looking at logs through journalctl, they were sufficiently ambiguous to me that I couldn't be sure.)

Sidebar: A further puzzle

At this point I don't have my laptop and its logs of TLS certificate information handy, but the more I look at our university page for campus wireless and the certificates it lists, the more puzzled I get. My attempts to verify the TLS certificate started with the TLS certificate listed there and proceeded through what 'certigo dump' told me were the CA certificates for that TLS certificate. However, now that I look more carefully, the page also has a CA bundle that is supposed to be current, but that CA bundle has a rather different set of CA certificates. It's possible that had I gotten and used that CA bundle, the actual 802.1x TLS certificate I was presented with would have verified.

(It's apparently possible to capture the 802.1x server TLS certificate, but it may not be easy. And you have to be on the wireless network in question, which I'm not as I write this entry.)

NetworkManagerTLSFor8021x written at 00:26:42; Add Comment


How mountd and exportfs handle NFS export permissions on Linux

While the Linux kernel NFS server maintains an authentication cache, the final authority on what what filesystems are exported to who and with what permissions is rpc.mountd. Mountd gets this information from /var/lib/nfs/etab, which conveniently is a plain text file. However, mountd reads the information from etab into an internal data structure and only re-does this when etab's inode number changes. As far as I can tell there's nothing else to it, which means that you can create a new version of etab by hand if you want to.

(While lsof will tell you that rpc.mountd has etab open, mountd does this purely so that etab's current inode number can't be reused for a new file. It never re-reads the opened file.)

Normally, new versions of /var/lib/nfs/etab are created only by exportfs (which writes the new version to an 'etab.tmp' file and then renames it). Because exportfs allows you to make NFS export changes through the command line that are not present in /etc/exports and /etc/exports.d/ files, in normal operation exportfs determines the new contents of /var/lib/nfs/etab in part by merging the current contents in with your new changes. Your new changes can come from the command line, for things like 'exportfs -u <client>:<path>' and 'exportfs -i -o <options> <client>:<path>', or from /etc/exports and company for things like 'exportfs -a'. This behavior of merging in the exports from the current etab is why 'exportfs -a' doesn't remove exports that are no longer in /etc/exports.

(A plain 'exportfs -au' has the obvious behavior of writing an empty /var/lib/nfs/etab.)

For exports that exist in both etab and exports, this merging process will replace export options from the old etab with the versions from /etc/exports and company, including things like 'ro' versus 'rw'. This means that an 'exportfs -a' will at least update client access permissions for existing exports, even if it won't cut off clients who have been entirely removed from the export permissions.

Exportfs also has a '-r' option, which is described by the manpage as:

Reexport all directories, synchronizing /var/lib/nfs/etab with /etc/exports. This option removes entries in /var/lib/nfs/etab which have been deleted from /etc/exports, and removes any entries from the kernel export table which are no longer valid.

Although the code in exportfs.c is hard to follow, the first part of 'exportfs -r' is implemented by generating the new etab purely from /etc/exports and company, without merging in the current contents of /var/lib/nfs/etab. This does exactly what you want once the kernel caches are flushed, and does it without un-exporting anything that should stay exported. If something is exported in both the old etab and the updated etab, obviously rpc.mountd will always permit access; there will never be a period where rpc.mountd is using an etab without it listed.

(The second claim in the description is what I would call not entirely correct. The actual code simply flushes the caches in general. As covered in this entry, in modern kernels any flush is a total flush, which means that 'exportfs -fr' and 'exportfs -r' do the same thing in the end. In older kernels, what gets flushed without '-f' is somewhat chancy, so I think you probably want to use '-f' to be sure.)

Back in January I mentioned 'exportfs -r' and wondered why our system for ZFS NFS export permissions wasn't using it. Given what I now know about how all of this works, we definitely should be using 'exportfs -r' instead of our current approach of first un-exporting a filesystem and then re-exporting it (and we'll be changing our scripts to implement this). 'exportfs -r' does exactly what we want when changing the NFS sharing for an existing NFS export. In general, what you want to do for seamless persistent NFS export changes is to write the new information to /etc/exports or some file in /etc/exports.d and then run 'exportfs -r' to resynchronize /var/lib/nfs/etab to your new reality.

(I believe that you want to do this even when removing an export entirely, or at least that doing so is the simplest way of un-exporting something.)

To put it another way, consistently using 'exportfs -r' turns /var/lib/nfs/etab into a processed cache instead of another source of truth. That's certainly what we want in general operation (perhaps not in emergencies, but emergencies are special cases).

NFSExportPermsHandling written at 22:18:39; Add Comment


I think I like systemd's DynamicUser feature (under the right circumstances)

Our Prometheus metrics system involves a lot of daemons that do things like generate metrics, both official daemons and various third party ones. Many of these daemons and the things they do are essentially stateless, because they can be and it makes them simpler. I was recently setting up such a daemon (a new one) on my office workstation, and as part of that I wanted to pick a UID for it to run as. I consider this particular daemon slightly more risky than usual, so I didn't want to go with either root (it doesn't need that much power) or the 'prometheus' user, which is not root but does wind up owning things like the Prometheus metrics data storage. At about the time I was looking through my /etc/passwd to try to find a user that I was comfortable with and that would work, a little light went on in my mind and I remembered that systemd services can use dynamic users.

Stateless daemons with no special permissions requirements are an ideal match for dynamic users, because basically I just want a generic non-root user. The user doesn't need to have any special privileges, and it doesn't need a stable UID or GID because it will never have anything on disk (or at least, not outside of brief uses of /tmp). Systemd making these UIDs and GIDs up on the fly saves me the effort of creating one or more new users, and as you can see, having to explicitly create new users is enough annoyance that I might not do it at all.

(The other advantage of dynamic users here is that if I decide to stop using a daemon, I'm not left with a stray user and group to clean up at some indefinite point in the future.)

Switching my .service file from a 'User=' line to 'DynamicUser=yes' basically just worked. After a daemon restart, the daemon was running happily under its new unique UID with everything working fine. The daemons I converted had no problems running other programs or making network connections, either.

You don't have to restrict a DynamicUser service to standard Unix 'regular UID' permissions (well, somewhat less than that, since systemd adds extra restrictions). I run Prometheus's Blackbox exporter as a non-root user but explicitly augment its capabilities with CAP_NET_RAW so that it can send and receive ICMP packets:


This still works fine with it converted from 'User=prometheus' to 'DynamicUser=yes'.

After this positive experience, I'm probably going to start making more use of 'DynamicUser=yes'. If a stateless thing doesn't have to run as root, switching it to using a dynamic user is both pretty trivial and a bit more secure.

(Systemd theoretically supports dynamic users for services with some state, but there can be problems with that.)

SystemdDynamicUserLike written at 22:32:38; Add Comment

(Previous 10 or go back to July 2019 at 2019/07/15)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.