A gotcha with Fedora 30's switch of Grub to BootLoaderSpec based configuration
I upgraded my office workstation from Fedora
29 to Fedora 30 yesterday. In the past, such upgrades been problem
free, but this time around things went fairly badly, with the
first and largest problem being that after the upgrade, booting any
kernel gave me a brief burst of kernel messages, then a blank screen
and after a few minutes a return to the BIOS and Grub main menu.
To get my desktop to boot at all, I had to add '
nomodeset' to the
kernel command line; among other consequences, this made my desktop
a single display machine instead of a dual display one.
(It was remarkably disorienting to have my screen mirrored across both displays. I kept trying to change to the 'other' display and having things not work.)
The short version of the root cause is that my
rebuilt using outdated kernel command line arguments that came from
/etc/default/grub, instead of the current command line arguments
that had previously been used in my original
of how the Fedora 30
grub.cfg is implemented, these wrong command
line arguments were then remarkably sticky and it wasn't clear how
to change them.
In Fedora 29 and earlier, your
grub.cfg is probably being maintained
grubby, Fedora's program for this. When
grubby adds a
menu entry for a new kernel, it more or less copies the kernel
command line arguments from your current one. While there is a
GRUB_CMDLINE_LINUX setting in /etc/default/grub, its contents
are ignored until and unless you rebuild your
scratch, and there's nothing that tries to update it from what your
current kernels in your current
grub.cfg are actually using. This
means that your /etc/default/grub version can wind up being very
different from what you're currently using and actually need to
make your kernels work.
One of the things that usually happens by default when you upgrade
to Fedora 30 is that Fedora switches how
grub.cfg is created and
updated from the old way of doing it itself via
grubby to using
a Boot Loader Specification (BLS) based scheme; you
can read about this switch in the Fedora wiki.
This switch regenerates your
grub.cfg using a shell script called
grub2-switch-to-blscfg, and this shell script of
course uses /etc/default/grub's
GRUB_CMDLINE_LINUX as the source
of the kernel arguments.
(This is controlled by whether
GRUB_ENABLE_BLSCFG is set to
false in your /etc/default/grub. If it's not set at
all, grub2-switch-to-blscfg adds a '
setting to /etc/default/grub for you, and of course goes on to
grub.cfg. grub2-switch-to-blscfg itself is run
from the Fedora 30
grub2-tools RPM posttrans scriptlet if
GRUB_ENABLE_BLSCFG is not already set to something in your
grub.cfg has a
default_kernelopts setting, and
that looks like it should be what you want to change. However, it
is not. The real kernel command line for normal BLS entries is
actually in the Grub2
$kernelopts environment variable, which is
loaded from the
grubenv file, normally
(which may be a symlink to
/boot/efi/EFI/fedora/grubenv, even if
you're not actually using EFI boot). The
best way to change this is to use '
grub2-editenv - list' and
grub2-editenv - set kernelopts="..."'. I assume that
default_kernelopts is magically used by the
$kernelopts is unset, and possibly gets written back
grubenv by Grub2 in that case.
(You can check that your kernels are using
$kernelopts by inspecting
an entry in
/boot/loader/entries and seeing that it has '
$kernelopts' instead of anything else. You can manually change
that for a specific entry if you want to.)
This is going to make it more interesting (by which I mean annoying)
if and when I need to change my standard kernel options. I think
I'm going to have to change all of /etc/default/grub, the
in grubenv, and the
grub.cfg, just to
be sure. If I was happy with the auto-generated
grub.cfg, I could
just change /etc/default/grub and force a regeneration, but I'm not
and I have not yet worked out how to make its handling of the video
modes and the menus agree with what I want (which is a basic text
(While I was initially tempted to leave my system as a non-BLS
system, I changed my mind because of long term issues. Fedora will
probably drop support for
grubby based setups sooner or later,
so I might as well get on the BLS train now.)
To give credit where it's due, one (lucky) reason that I was able to eventually work out all of this is that I'd already heard about problems with the BLS transition in Fedora 30 in things like Fedora 30: When grub2-mkconfig Doesn’t Work, and My experiences upgrading to Fedora 30. Without that initial awareness of the existence of the BLS transition in Fedora 30 (and the problems it caused people), I might have been flailing around for even longer than I was.
PS: As a result of all of this, I've discovered that you no longer
need to specify the root device in the kernel command line arguments.
I assume the necessary information for that is in the dracut-built
initramfs. As far as the blank screen and kernel panics go, I suspect
that the cause is either or both of '
amdgpu.dpm=0' and '
which were still present in the /etc/default/grub arguments but
which I'd long since removed from my actual kernel command lines.
(I could conduct more experiments to try to find out which kernel argument is the fatal one, but my interest in more reboots is rather low.)
Systemd and waiting until network interfaces or addresses are configured
One of the things that systemd is very down on is the idea of running services after 'the network is up', whatever that means; the systemd people have an entire web page on the subject. This is all well and good in theory, but in practice there are plenty of situations where I need to only start certain things after either a named network interface is present or an IP address exists. For a concrete example, you can't set up various pieces of policy based routing for an interface until the interface actually exists. If you're configuring this on boot in a systemd based system (especially one using networkd), you need some way to insure the ordering. Similarly, sometimes you need to listen only on some specific IP addresses and the software you're using doesn't have Linux specific hacks to do that when the IP address doesn't exist yet.
(As a grumpy sysadmin, I actually don't like the behavior of binding to an IP address that doesn't exist, because it means that daemons will start and run even if the system will never have the IP address. I would much rather delay daemon startup until the IP address exists.)
Systemd does not have direct native support for any of this, of course. There's no way to directly say that you depend on an interface or an IP address, and in general the dependency structure has long been under-documented. The closest you can get to waiting until a named network interface exists is to specify an After= and perhaps a Want= or a Requires= on the pseudo-unit for the network interface, 'sys-subsystem-net-devices-<iface>.device'. However, as I found out, the lack of a .device unit doesn't always mean that the interface doesn't exist.
You might think that in order to wait for an IP address to exist, you could specify an After= for the .device unit it's created in and by. However, this has historically had issues for me; under at least some versions of systemd, the .device unit would be created before the IP address was configured. In my particular situation, what worked at the time was to wait for a VLAN interface .device that was on top of the real interface that had the IP address (and yes, I mix tagged VLANs with an untagged network). By the time the VLAN .device existed, the IP address had relatively reliably been set up.
If you're using systemd-networkd and care about network interfaces, the easiest approach is probably to rely on systemd-networkd-wait-online.service; how it works and what it waits for is probably about as good as you can get. For IP addresses, as far as I know there's no native thing that specifically waits until some or all of your static IP addresses are present. Waiting for systemd-networkd-wait-online is probably going to be good enough for most circumstances, but if I needed better I would probably write a shell script (and a .service unit for it) that simply waited until the IP addresses I needed were present.
(I continue to think that it's a real pity that you can't configure networkd .network files to have 'network up' and 'network down' scripts, especially since their stuff for routing and policy based routing is really very verbose.)
PS: One of the unfortunate effects of the under-documented dependency structure and the lack of clarity of what to wait on is a certain amount of what I will call 'superstitious dependencies', things that you've put into your systemd units without fully understanding whether or not you needed them, and why (often also without fully documenting them). This is fine most of the time, but then one day an unnecessary dependency fails to start or perhaps exist and then you're unhappy. That's part of why I would like explicit and reliable ways to do all of this.
Linux can run out of memory without triggering the Out-Of-Memory killer
If you have a machine with strict overcommit turned on, your memory allocation requests will start to fail once enough virtual address space has been committed, because that's what you told the kernel to do. Hitting your strict overcommit limit doesn't trigger the Out-Of-Memory killer, because the two care about different things; strict memory overcommit cares about committed address space, while the global OOM killer cares about physical RAM. Hitting the commit limit may kill programs anyway, because many programs die if their allocations fail. Also, under the right situations, you can trigger the OOM killer on a machine set to strict overcommit.
Until recently, if you had asked me about how Linux behaved in the default 'heuristic overcommit' mode, I would have told you that ordinary memory allocations would never fail in it; instead, if you ran out of memory (really RAM), the OOM killer would trigger. We've recently found out that this is not the case, at least in the Ubuntu 18.04 LTS '4.15.0' kernel. Under (un)suitable loads, various of our systems can run out of memory without triggering the OOM killer and persist in this state for some time. When it happens, the symptoms are basically the same as what happens under strict overcommit; all sorts of things can't fork, can't map shared libraries, and so on. Sometimes the OOM killer is eventually invoked, other times the situation resolves itself, and every so often we have to reboot a machine to recover it.
I would like to be able to tell you why and how this happens, but
I can't. Based on the kernel code involved, the memory allocations
aren't being refused because of heuristic overcommit, which still
has its very liberal limits on how much memory you can ask for (see
__vm_enough_memory in mm/util.c).
Instead something else is causing forks,
mmap()s of shared
libraries, and so on to fail with 'out of memory'
and whatever that something is it doesn't trigger the OOM killer
during the failure and doesn't cause the kernel to log any other
messages, such as the ones you can see for page allocation failures.
(Well, the messages you see for certain page allocations. Page
allocations can be flagged as
__GFP_NOWARN, which suppresses
PS: Unlike the first time we saw this, the recent cases have committed address space rising along with active anonymous pages, and the kernel's available memory dropping in sync and hitting zero at about the time we see failures start.
Roughly when the Linux Out-Of-Memory killer triggers (as of mid-2019)
For reasons beyond the scope of this entry, I've recently become interested in understanding more about when the Linux OOM killer does and doesn't trigger, and why. Detailed documentation on this is somewhat sparse and and some of it is outdated (eg). I can't add detailed documentation, because doing that requires fully understanding kernel memory management code, but I can at least write down some broad overviews for my own use.
(All of this is as of the current Linux kernel git tree, because that's what I have on hand. The specific details change over time, although the code seems broadly unchanged between git tip and the Ubuntu 18.04 LTS kernel, which claims to be some version of 4.15.)
These days there are two sort of different OOM killers in the kernel; there is the global OOM killer and then there is cgroup-based OOM through the cgroup memory controller, either cgroup v1 or cgroup v2. I'm primarily interested in when the global OOM killer triggers, partly because the cgroup OOM killer is relatively more predictable.
The simple answer is that the global OOM killer triggers when the kernel has problems allocating pages of physical RAM. When the kernel is attempting to allocate pages of RAM (for whatever use, either for kernel usage or for processes that need pages) and initially fails, it will try various ways to reclaim and compact memory. If this works or at least makes some progress, the kernel keeps retrying the allocation (as far as I can tell from the code); if they fail to free up pages or make progress, it triggers the OOM killer under many (but not all) circumstances.
(The OOM killer is not triggered if, for instance, the kernel is
asking for a sufficiently large number of contiguous pages, cf. At the moment, the OOM killer is still
only invoked for contiguous allocations of 32 Kb or less (order 3),
which is the same as it was back in 2012; in fact, '
says this dates from 2007.)
As far as I can tell, there's nothing that stops the OOM killer being triggered repeatedly for the same attempted page allocation. If the OOM killer says it made progress, the page allocation is retried, but there's probably no guarantee that you can get memory now (any freed memory might have been grabbed by another request, for example). Similarly, as far as I can tell the OOM killer can be invoked repeatedly in close succession; there doesn't seem to be any 'must be X time between OOM kills' limits in the current code. The trigger is simply that the kernel needs pages of RAM and it can't seem to get them any other way.
(Of course you hope that triggering the OOM killer once frees up a bunch of pages of RAM, since that's what it's there for.)
The global OOM killer is not particularly triggered when processes simply allocate (virtual) memory, because this doesn't necessarily allocate physical pages of RAM. Decisions about whether or not to grant such memory allocation requests are not necessarily independent of the state of the machine's physical RAM, but I'm pretty sure you can trigger the OOM killer without having reached strict overcommit limits and you can definitely have memory allocation requests fail without triggering the OOM killer.
PS: I will avoid speculating about situations where this approach might fail to trigger the OOM killer when it really should, but depending on how reclaim is implemented, there seem to be some relatively obvious possibilities.
Sidebar: When I think cgroups OOM is triggered
If you're using the memory cgroup controller (v1 or v2) and you set a maximum memory limit, this is (normally) a limit on how much RAM the cgroup can use. As the cgroup's RAM usage grows towards this limit, the kernel memory system will attempt to evict the cgroup's pages from RAM in various ways (such as swapping them out). If it fails to evict enough pages fast enough and the cgroup runs into its hard limit on RAM usage, the kernel triggers the OOM killer against the cgroup.
This particular sausage seems to be made in mm/memcontrol.c.
You want to look for the call to
out_of_memory and work backward.
I believe that all of this is triggered by any occasion when a page
of RAM is charged to a cgroup, which includes more than just the RAM
directly used by processes.
(In common configurations, I believe that a cgroup with such a hard memory limit can consume all of your swap space before it triggers the OOM killer.)
If you want to know whether a OOM kill was global or from a cgroup limit, this is in the kernel message. For a cgroup OOM kill, the kernel message will look like this:
Memory cgroup out of memory: Kill process ... score <num> or sacrifice child
For a global out of memory, the kernel message will look like this:
Out of memory: Kill process ... score <num> or sacrifice child
I sort of wish the global version specifically mentioned that it was a non-cgroup OOM kill, but you can see how it wound up this way; when cgroup OOM kills were introduced, I suspected that no one wanted to change the existing global OOM kill message for various reasons.
Rewriting my iptables rules using ipsets
On Mastodon, I was tempted:
My home and office workstation have complicated networking, but their firewall rules are actually relatively simple. Maybe it's time to switch them over from annoying iptables to the new shiny nftables stuff, which might at least be more readable (and involve less repetition).
Feedback convinced me to not go that far. Instead, today I rewrote my iptables rules in terms of ipsets (with multiple set matches), which eliminated a great deal of their prior annoyance (although not all of it).
My workstation firewall rules did not previously use ipsets because
I first wrote them before ipsets were a thing; in fact, they date
from the days of ipchains
and Linux 2.2. In the pre-ipset world, this meant a separate iptables
rule for each combination of source IP, destination port, and
protocol that I wanted to block (or allow). On my office workstation,
this wound up with over 180
INPUT table rules (most of them
Contrary to what I asserted a few years ago, most of the actual firewall rules being expressed by all of these iptables rules are pretty straightforward. Once I simplified things a bit, there are some ports that only my local machine can access, some ports that only 'friendly' machines can access, and some machines I don't like that should be blocked from a large collection of ports, even ones that are normally generally accessible. This has an obvious translation to ipset based rules, especially if I don't try to be too clever, and the result is a lot fewer rules that are a lot easier to look over. There's still some annoying repetition because I want to match both the TCP and UDP versions of most ports, but I can live with that.
(Enough of the ports that I want to block access to come in both TCP and UDP versions that it's not worth making a finer distinction. That would lead to more ipsets, which is more annoying in practice.)
When I did the rewrite, I did simplify some of the fine distinctions I had previously made between various ports and various machines. I also dropped some things that were obsolete, both in terms of ports that I was blocking and things like preventing unencrypted GRE traffic, since I no longer use IPsec. I could have done this sort of reform without a rewrite, but I had nothing to push me to do it until now and it wouldn't have been as much of a win. The actual rewrite was a pretty quick process and the resulting shell script is what I consider to be straightforward.
(The new rules also have some improvements; for example, I now have some IPv6 blocks on my home machine. Since I already had an ipset of ports, I could say 'block incoming IPv6 traffic from my external interface to these ports' in a single ip6tables rule.)
As far as I'm concerned, so far the three big wins of the rewrite
are that '
iptables -nL INPUT' no longer scrolls excessively, I'm
no longer dependent on my ancient automation to generate iptables
rules, and I've wound up writing a shell script to totally clear
out all of my iptables rules and tables (because I kept wanting to
re-run my setup script as I changed it). That my ancient automation
silently broke for a while (again) and left my office workstation
without most of its blocks since late March is one thing that pushed
me into making this change now.
(Late March is when I updated to my first Fedora 5.x kernel, and guess what my ancient automation threw up its hands at. If you're curious why, it was (still) looking at the kernel version to decide whether to use ipchains or iptables.)
Some notes on understanding how to use
command has rather complicated usage, with a bunch of options, which
makes it not entirely clear how to use it for shell script locking in various different circumstances. Here
are some notes on this, starting with understanding what it's doing
and the implications of that.
The key to understanding all of
flock(1)'s weird options is to
automatically releases your lock when the last copy of the file
descriptor you locked is closed, and that file descriptors are
shared with child processes.
Given this, we can start with the common basic
flock(1) usage of:
flock -n LOCKFILE SHELL-SCRIPT [ARGS ...]
LOCKFILE, locks it with
flock(2), and then
starts the shell script, which will inherit an open (and locked)
file descriptor for
LOCKFILE. As long as the shell script process
or any sub-process it starts still exists with that file descriptor
open, the lock is held, even if
flock(1) itself is killed for
This is generally what you want; so long as any component of the
shell script and the commands it runs is still running, it's
potentially not safe to start another copy. Only when everything
flock included, is the lock released.
However, this is perhaps not what you want if
flock is used to
start a daemon that doesn't close all of its file descriptors,
because then the daemon will inherit the open (and locked) file
LOCKFILE and the lock will never be released. If
this is the case, you want to start
flock with the
which does not pass the open file descriptor for
LOCKFILE to the
flock winds up running:
flock -n -o LOCKFILE SHELL-SCRIPT [ARGS ...]
Run this way, the only thing holding the lock is
flock exits (for whatever reason), the file descriptor will
be closed and the lock released, even if
SHELL-SCRIPT is still
(Of course, having a daemon inherit an open and locked file descriptor
LOCKFILE is a convenient way to only have one copy of the daemon
running. As long as the first copy is still running, further attempts to
get the lock will fail; if it exits, the lock is released.)
The final usage is that
flock(1) can be directly told the file
descriptor number to lock. In order to be useful, this requires
some shared file descriptor that will live on after
the usual place to get this is by redirecting some file descriptor
of your choice to or from a file for an entire block of a shell
script, like this:
( flock -n 9 || exit 1 ... # locked commands ) 9>/var/lock/mylockfile
This is convenient if you only want to lock some portions of a shell
script or don't want to split a shell script into two, especially
since the first will just be '
flock -n /var/lock/mylockfile
script-part-2'. On the other hand, it is sort of tricky and clever,
perhaps too clever. I'd certainly want to comment it heavily in
any shell script I wrote.
However, you don't necessarily have to go all the way to doing this
if you just want to
flock some stuff that involves shell operations
like redirecting files and so on, because you can use '
to run a shell command line instead of just a program:
flock -n LOCKFILE -c '(command1 | command2 >/some/where) && command3'
This can also get too tricky, of course. There's only so much that's sensible to wedge into a single shell command line, regardless of what's technically possible.
Once you're locking file descriptors, you can also unlock file
descriptors with '
flock -u'. This is probably useful mostly if
you're going to unlock and then re-lock, and that probably wants
you to be using
flock without the '
-n' option for at least the
re-lock. I imagine you could use this in a shell script loop, for
example something like:
( for file in "$@"; do flock 9; big-process "$file"; flock -u 9 more-work ... done ) 9>/var/lock/mylockfile
This would allow
more-work to run in parallel with another
big-process, while not allowing two
to be running at once.
(This feels even more tricky and clever than the basic usage of
flock'ing a file descriptor in a shell '
( ... )' block, so I
suspect I'll never use it.)
If you can, you should use
flock(1) for shell script locking
I have in the past worked out and used complicated but portable approaches for doing locking in shell scripts. These approaches get much more complicated if processes can die abruptly without undoing their lock. You can generally arrange things so that your locks are cleared if the entire machine reboots, but that's about it as far as simple approaches go. Sometimes this is what you want, but often it isn't.
As a result of a series of issues with our traditional shell script
locking, I have been more and more moving to using Linux's
flock(1) when I can,
which is to say for scripts that only have to run on our Linux
machines (which is almost all of our machines today).
sufficiently useful and compelling here that I might actually port
it over to other Unixes if we had to integrate such systems into
our current Linux environment.
(Anything we want to use should have
flock(2), and hopefully that's the only
flock program really depends on.)
There are two strongly appealing sides to
flock. The first is
that it provides basically the usage that we want; in normal
operation, it runs something with the lock held and releases the
lock when the thing exits. The second is that it automatically
releases the lock if something goes wrong, because
evaporate when the file descriptor is closed.
(The manpage's description of '
-o' may make you confused about
flock means is that the open file descriptor of the
lock is not inherited by the command
flock runs. Normally you
want the command to inherit the open file descriptor, because it
means that so long as any process involved is still running, the
lock is held, even if
flock itself gets killed for some reason.)
Generally I want to use '
flock -n', because we mostly use locking
for 'only one of these should ever be running at once'; if the lock
is held, a previous cron job or whatever is still active, so the
current one should just give up.
We have one script using a traditional shell script approach to
locking that I very carefully and painfully revised to be more or
less safe in the face of getting killed abruptly. Since it logs
diagnostics if it detects a stale lock, there's a certain amount
of use in having it around, but I definitely don't want to ever
have to do another script like it, and it's a special case in some
other ways that might make it awkward to use with
experience of revising that script is part of what pushed me very
strongly to using
flock for others.
Getting NetworkManager to probably verify TLS certificates for 802.1x networks
I'll start with my tweets:
We have an 802.1.X WPA2 Enterprise university-wide wireless network, using PEAPv0 authentication, which involves a TLS certificate. I do not appear to be able to get NetworkManager to verify the TLS certificate in a way that will let me actually connect.
The only way I can connect to our university wifi is by setting 'No CA certificate is required'. I cannot supply a CA certificate that works (I've tried), and I cannot turn on 802-1x.system-ca-certs ; nmcli just doesn't save it, no matter what, without any reported error.
With the aid of some replies from @grawity, I was able to navigate to a solution that allows me to connect without that 'No CA certificate is required' having to be set, and probably even verifies the TLS certificate.
The magic trick for me was telling NetworkManager that it should
use the system bundle of TLS certificate as the 'CA certificate'
it wants. The one important trick is that NetworkManager wants the
PEM format certificate bundle (and/or
certificate), not the DER form.
How you tell them apart is that the PEM form is base64 ASCII while
the DER form is binary. Anything with a
.pem extension had better
be a PEM file, but a
.crt extension can be either.
On Fedora 29, the system certificate bundle is found as either /etc/ssl/certs/ca-bundle.crt or /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem; the former is a symlink to the latter. On Ubuntu and Debian, you want /etc/ssl/certs/ca-certificates.crt. I don't know if there are any special SELinux considerations that apply depending on the path you select, because I turned that off long ago on my laptop.
I don't know if this setup makes NetworkManager actually verify the TLS certificates (or perhaps wpa_supplicant, which is apparently the thing that really does the work even when NetworkManager is being the frontend). But at least I'm not telling NetworkManager to maybe ignore TLS security entirely.
(When I was looking at logs through journalctl, they were sufficiently ambiguous to me that I couldn't be sure.)
Sidebar: A further puzzle
At this point I don't have my laptop and its logs of TLS certificate
information handy, but the more I look at our university page for
campus wireless and the certificates it lists, the more puzzled I get.
My attempts to verify the TLS certificate started with the TLS
certificate listed there and proceeded through what '
dump' told me were the CA
certificates for that TLS certificate. However, now that I look
more carefully, the page also has a CA bundle that is supposed to
be current, but that CA bundle has a rather different set of CA
certificates. It's possible that had I gotten and used that CA
bundle, the actual 802.1x TLS certificate I was presented with would
(It's apparently possible to capture the 802.1x server TLS certificate, but it may not be easy. And you have to be on the wireless network in question, which I'm not as I write this entry.)
exportfs handle NFS export permissions on Linux
While the Linux kernel NFS server maintains an authentication
cache, the final authority on what
what filesystems are exported to who and with what permissions is
gets this information from
/var/lib/nfs/etab, which conveniently
is a plain text file. However, mountd reads the information from
etab into an internal data structure and only re-does this when
etab's inode number changes. As far as I can tell there's
nothing else to it, which means that you can create a new version
etab by hand if you want to.
lsof will tell you that
mountd does this purely so that
etab's current inode number can't
be reused for a new file. It never re-reads the opened file.)
Normally, new versions of
/var/lib/nfs/etab are created only by
exportfs (which writes
the new version to an '
etab.tmp' file and then renames it). Because
exportfs allows you to make NFS export changes through the command
line that are not present in
files, in normal operation
exportfs determines the new contents
/var/lib/nfs/etab in part by merging the current contents in
with your new changes. Your new changes can come from the command
line, for things like '
exportfs -u <client>:<path>' and '
-i -o <options> <client>:<path>', or from
/etc/exports and company
for things like '
exportfs -a'. This behavior of merging in the
exports from the current
etab is why '
exportfs -a' doesn't
remove exports that are no longer in
(A plain '
exportfs -au' has the obvious behavior of writing an
For exports that exist in both
exports, this merging
process will replace export options from the old
etab with the
/etc/exports and company, including things like
ro' versus '
rw'. This means that an '
exportfs -a' will at
least update client access permissions for existing exports, even
if it won't cut off clients who have been entirely removed from the
Exportfs also has a '
-r' option, which is described by the manpage
Reexport all directories, synchronizing
/etc/exports. This option removes entries in
/var/lib/nfs/etabwhich have been deleted from
/etc/exports, and removes any entries from the kernel export table which are no longer valid.
Although the code in exportfs.c
is hard to follow, the first part of '
exportfs -r' is implemented
by generating the new
etab purely from
/etc/exports and company,
without merging in the current contents of
This does exactly what you want once the kernel caches are flushed, and does it without un-exporting
anything that should stay exported. If
something is exported in both the old
etab and the updated
rpc.mountd will always permit access; there will
never be a period where
rpc.mountd is using an
etab without it
(The second claim in the description is what I would call not
entirely correct. The actual code
simply flushes the caches in general. As covered in this entry, in modern kernels any flush is a
total flush, which means that '
exportfs -fr' and '
do the same thing in the end. In older kernels, what gets flushed
-f' is somewhat chancy, so I think you probably want to
-f' to be sure.)
Back in January I mentioned '
exportfs -r' and wondered why our
system for ZFS NFS export permissions wasn't using it. Given what I now know about how all of
this works, we definitely should be using '
exportfs -r' instead
of our current approach of first un-exporting a filesystem and then
re-exporting it (and we'll be changing our scripts to implement
exportfs -r' does exactly what we want when changing the
NFS sharing for an existing NFS export. In general, what you want
to do for seamless persistent NFS export changes is to write the
new information to
/etc/exports or some file in
and then run '
exportfs -r' to resynchronize
to your new reality.
(I believe that you want to do this even when removing an export entirely, or at least that doing so is the simplest way of un-exporting something.)
To put it another way, consistently using '
exportfs -r' turns
/var/lib/nfs/etab into a processed cache instead of another source
of truth. That's certainly what we want in general operation (perhaps
not in emergencies, but emergencies are special cases).
I think I like systemd's
DynamicUser feature (under the right circumstances)
Our Prometheus metrics system involves
a lot of daemons that do things like generate metrics, both official
daemons and various
third party ones. Many of these
daemons and the things they do are essentially stateless, because
they can be and it makes
I was recently setting up such a daemon (a new one) on my office
workstation, and as part of that I wanted to pick a UID for it to
run as. I consider this particular daemon slightly more risky than
usual, so I didn't want to go with either
root (it doesn't need
that much power) or the '
prometheus' user, which is not root but
does wind up owning things like the Prometheus metrics data storage.
At about the time I was looking through my
/etc/passwd to try to
find a user that I was comfortable with and that would work, a
little light went on in my mind and I remembered that systemd
services can use dynamic users.
Stateless daemons with no special permissions requirements are an
ideal match for dynamic users, because basically I just want a
generic non-root user. The user doesn't need to have any special
privileges, and it doesn't need a stable UID or GID because it will
never have anything on disk (or at least, not outside of brief uses
/tmp). Systemd making these UIDs and GIDs up on the fly saves
me the effort of creating one or more new users, and as you can
see, having to explicitly create new users is enough annoyance that
I might not do it at all.
(The other advantage of dynamic users here is that if I decide to stop using a daemon, I'm not left with a stray user and group to clean up at some indefinite point in the future.)
.service file from a '
User=' line to '
basically just worked. After a daemon restart, the daemon was running
happily under its new unique UID with everything working fine. The
daemons I converted had no problems running other programs or making
network connections, either.
You don't have to restrict a
DynamicUser service to standard Unix
'regular UID' permissions (well, somewhat less than that, since
systemd adds extra restrictions). I run Prometheus's Blackbox
exporter as a
non-root user but explicitly augment its capabilities with
CAP_NET_RAW so that it can send and receive ICMP packets:
[Service] [...] CapabilityBoundingSet=CAP_NET_RAW AmbientCapabilities=CAP_NET_RAW
This still works fine with it converted from '
After this positive experience, I'm probably going to start making
more use of '
DynamicUser=yes'. If a stateless thing doesn't have
to run as root, switching it to using a dynamic user is both pretty
trivial and a bit more secure.
(Systemd theoretically supports dynamic users for services with some state, but there can be problems with that.)