How Linux starts non-system software RAID arrays during boot under systemd
In theory, you do not need to care about how your Linux software RAID arrays get assembled and started during boot because it all just works. In practice, sometimes you do, and on a modern systemd-based Linux this seems to be an unusually tangled situation. So here is what I can determine so far about how it works for software RAID arrays that are assembled and started outside of the initramfs, after your system has mounted your real root filesystem and is running from it.
(How things work for starting software RAID arrays in the initramfs is quite varied between Linux distributions. There is some distribution variation even for post-initramfs booting, but these days the master version of mdadm ships canonical udev and systemd scripts, services, and so on and I think most distributions use them almost unchanged.)
As has been the case for some time,
the basic work is done through
udev rules. On a typical Linux
system, the main udev rule file for assembly will be called something
like 64-md-raid-assembly.rules and be basically the upstream
Udev itself identifies block devices that are potentially Linux
RAID members (probably mostly based on the presence of RAID
superblocks), and mdadm's udev rules then run
mdadm in a special
incremental assembly mode on them. To quote the manpage:
This mode is designed to be used in conjunction with a device discovery system. As devices are found in a system, they can be passed to
mdadm --incrementalto be conditionally added to an appropriate array.
As array components become visible to udev and cause it to run
mdadm --incremental on them,
mdadm progressively adds them to
the array. When the final device is added,
mdadm will start the
array. This makes the software RAID array and its contents visible to
udev and to systemd, where it will be used to satisfy dependencies for
/etc/fstab mounts and thus trigger them happening.
(There are additional mdadm udev rules for setting up device names, starting mdadm monitoring, and so on. And then there's a whole collection of general udev rules and other activities to do things like read the UUIDs of filesystems from new block devices.)
However, all of this only happens if all of the array component
devices show up in udev (and show up fast enough); if only some of
the devices show up, the software RAID will be partially assembled
mdadm --incremental but not started because it's not complete.
To deal with this situation and eventually start software RAID
arrays in degraded mode, mdadm's udev rules start a systemd timer
when enough of the array is present to let it run degraded,
specifically the templated timer unit mdadm-last-resort@.timer
(so for md0 the specific unit is firstname.lastname@example.org). If
the RAID array isn't assembled and the timer goes off, it triggers
the corresponding templated systemd service unit, using
which runs '
mdadm --run' on your degraded array to start it.
(The timer unit is only started when mdadm's incremental assembly reports back that it's 'unsafe' to assemble the array, as opposed to impossible. Mdadm reports this only once there are enough component devices present to run the array in a degraded mode; how many devices are required (and what devices) depends on the specific RAID level. RAID-1 arrays, for example, only require one component device to be 'unsafe'.)
Because there's an obvious race potential here, the systemd timer
and service both work hard to not act if the RAID array is actually
present and already started. The timer conflicts with
'sys-devices-virtual-block-<array>.device', the systemd device unit
representing the RAID array, and as an extra safety measure the
service refuses to run if the RAID array appears to be present in
/sys/devices. In addition, the udev rule that triggers systemd
starting the timer unit will only act on software RAID devices that
appear to belong to this system, either because they're listed in
mdadm.conf or because their home host is this host.
(This is the
MD_FOREIGN match in the udev rules.
The environment variables come from mdadm's
--export option, which
is used during udev incremental assembly. Mdadm's code for incremental
assembly, which also generates these environment variables, is in
enough() function is in util.c.)
As far as I know, none of this is documented or official; it's just how mdadm, udev, and systemd all behave and interact at the moment. However this appears to be pretty stable and long standing, so it's probably going to keep being the case in the future.
PS: As far as I can tell, all of this means that there are no real
user-accessible controls for whether or not degraded software RAID
arrays are started on boot. If you want to specifically block
degraded starts of some RAID arrays, it might work to '
mask' either or both of the last-resort timer and service unit for
the array. If you want to always start degraded arrays, well, the
good news is that that's supposed to happen automatically.
WireGuard was pleasantly easy to get working behind a NAT (or several)
Normally, my home machine is directly connected to the public Internet by its DSL connection. However, every so often this DSL connection falls over, and these days my backup method of Internet connectivity is that I tether my home machine through my phone. This tethering gives me an indirect Internet connection; my desktop is on a little private network provided by my phone and then my phone NAT's my outgoing traffic. Probably my cellular provider adds another level of NAT as well, and certainly the public IP address that all of my traffic appears from can hop around between random IPs and random networks.
Most of the time this works well enough for basic web browsing and even SSH sessions, but it has two problems when I'm connecting to things at work. The first is that my public IP address can change even while I have a SSH connection present (but perhaps not active enough), which naturally breaks the SSH connection. The second is that I only have 'outside' access to our servers; I can only SSH to or otherwise access machines that are accessible from the Internet, which excludes most of the interesting and important ones.
Up until recently I've just lived with this, because the whole issue just doesn't come up often enough to get me to do anything about it. Then this morning my home DSL connection died at a fairly inopportune time, when I was scheduled to do something from home that involved both access to internal machines and things that very much shouldn't risk having my SSH sessions cut off in mid-flight (and that I couldn't feasibly do from within a screen session, because it involved multiple windows). I emailed a co-worker to have them take over, which they fortunately were able to do, and then I decided to spend a little time to see if I could get my normal WireGuard tunnel up and running over my tethered and NAT'd phone connection, instead of its usual DSL setup. If I could bring up my WireGuard tunnel, I'd have both a stable IP for SSH sessions and access to our internal systems even when I had to use my fallback Internet option.
(I won't necessarily have uninterrupted SSH sessions, because if my phone changed public IPs there will be a pause as WireGuard re-connected and so on. But at least I'll have the chance to have sessions continue afterward, instead of being intrinsically broken.)
Well, the good news is that my WireGuard setup basically just worked as-is when I brought it up behind however many layers of NAT'ing are going on. The actual WireGuard configuration needed no changes and I only had to do some minor tinkering with my setup for policy-based routing (and one of the issues was my own fault). It was sufficiently easy that now I feel a bit silly for having not tried it before now.
(Things would not have been so easy if I'd decided to restrict what IP addresses could talk to WireGuard on my work machine, as I once considered doing.)
This is of course how WireGuard is supposed to work. Provided that you can pass its UDP traffic in both ways (which fortunately seems to work through the NAT'ing involved in my case), WireGuard doesn't care where your traffic comes from if it has the right keys, and your server will automatically update its idea of what (external) IP your client has right now when it gets new traffic, which makes everything work out.
(WireGuard is actually symmetric; either end will update its idea of the other end's IP when it gets appropriate traffic. It's just that under most circumstances your server end rarely changes its outgoing IP.)
I knew that in theory all of this should work, but it's still nice to have it actually work out in practice, especially in a situation with at least one level of NAT going on. I'm actually a little bit amazed that it does work through all of the NAT magic going on, especially since WireGuard is just UDP packets flying back and forth instead of a TCP connection (which any NAT had better be able to handle).
On a side note, although I did everything by hand this morning, in
theory I could automate all of this through
dhclient hook scripts, which I'm
already using to manage my resolv.conf (as covered in this entry). Of course this brings up a little issue,
because if the WireGuard tunnel is up and working I actually want
to use my regular resolv.conf instead of the one I switch to when
I'm tethering (without WireGuard). Probably I'm going to defer all
of this until the next time my DSL connection goes down.
I won't be trying out ZFS's new TRIM support for a while
ZFS on Linux's development version has
just landed support for using
TRIM commands on SSDs in order
to keep their performance up as you write more data to them and the
SSD thinks it's more and more full; you can see the commit here
and there's more discussion in the pull request. This is an exciting
development in general, and since ZoL 0.8.0 is in the release
candidate stage at the moment, this TRIM support might even make
its way into a full release in the not too distant future.
Normally, you might expect me to give this a try, as I have with other new things like sequential scrubs. I've tracked the ZoL development tree on my own machines for years basically without problems, and I definitely have fairly old pools on SSDs that could likely benefit from being TRIM'd. However, I haven't so much as touched the new TRIM support and probably won't for some time.
Some projects have a relatively unstable development tree where running it can routinely or periodically destabilize your environment and expose you to bugs. ZFS on Linux is not like this; historically the code that has landed in the development version has been quite stable and problem free. Code in the ZoL tree is almost always less 'in development' and more 'not in a release yet', partly because ZoL has solid development practices along with significant amounts of automated tests. As you can read in the 'how has this been tested?' section of the pull request, the TRIM code has been carefully exercised both through specific new tests and random invocation of TRIM through other tests.
All of this is true, but then there is the small fact that in practice, ZFS encryption is not ready yet despite having been in the ZoL development tree for some time. This isn't because ZFS encryption is bad code (or untested code); it's because ZFS encryption turns out to be complicated and to interact with lots of other things. The TRIM feature is probably less complicated than encryption, but it's not simple, there are plenty of potential corner cases, and life is complicated by potential issues in how real SSDs do or don't cope well with TRIM commands being issued in the way that ZoL will. Also, an errant TRIM operation inherently destroys some of your data, because that's what TRIM does.
All of this makes me feel that TRIM is inherently much more dangerous than the usual ZoL new feature, sufficiently dangerous that I don't feel confident enough to try it. This time around, I'm going to let other people do the experimentation and collect the arrows in their backs. I will probably only start using ZFS TRIM once it's in a released version and a number of people have used it for a while without explosions.
If you feel experimental despite this, I note that according to
the current manpage
an explicit '
zpool trim' can apparently be limited to a single
disk. I would definitely suggest using it that way (on a pool with
redundancy); TRIM a single disk, wait for the disk to settle and
finish everything, and then scrub your pool to verify that nothing
got damaged in your particular setup. This is definitely how I'm
going to start with ZFS TRIM, when I eventually do.
(On my work machine, I'm still tracking the ZoL tree so I'm using a version with TRIM available; I'm just not enabling it. On my home machine, for various reasons, I've currently frozen my ZoL version at a point just before TRIM landed, just in case. I have to admit that stopping updating ZoL does make the usual kernel update dance an easier thing, especially since WireGuard has stopped updating so frequently.)
Erasing SSDs with
blkdiscard (on Linux)
Our approach to upgrading servers by reinstalling them from scratch on new hardware means that we have a slow flow of previously used servers that we're going to reuse, and thus that need their disks cleaned up from their previous life. Some places would do this for data security reasons, but here we mostly care that lingering partitioning, software RAID superblocks, and so on don't cause us problems on new OS installs.
In the old days of HDs, we generally did this by zeroing out the
old drives with
dd (on a machine dedicated to the purpose which
was just left running in the corner, since this takes some time
with HDs), or sometimes with a full
badblocks scan. When
we started using SSDs in our servers, this didn't seem like such a
good idea any more. We didn't really want to use up some of the
SSD write endurance just to blank them out or worse, to write over
them repeatedly with
Our current solution to this is
basically sends a
TRIM command to the SSD.
Conveniently, the Ubuntu 18.04 server CD image that we use as the
base for our install images contains
blkdiscard, so we can boot
a decommissioned server from install media, wait for the Ubuntu
installer to initialize and find all the disks, and then switch
over to a text console to
blkdiscard its SSDs. In the process
of doing this a few times, I have developed a process and learned
some useful lessons.
First, just to be sure and in an excess of caution, I usually
explicitly zero the very start of each disk with '
of=/dev/sdX bs=1024k count=128; sync' (the count can vary). This
at least zeroes out the MBR partition no matter what. Then when I
blkdiscard, I generally background it because I've found that
it can take a while to finish and I may have more than one disk to
# blkdiscard /dev/sda & # blkdiscard /dev/sdb & # wait
I could do them one at a time, but precisely because it can take a while I usually wander away from the server to do other things. This gets everything done all at once, so I don't have to wait twice.
Finally, after I've run
blkdiscard and it's finished, I usually let
the server sit there running for a while. This is probably superstition,
but I feel like giving the SSDs time to process the
before either resetting them with a system reboot or powering the server
off (with a '
poweroff', which is theoretically orderly). If I had a
bunch of SSDs to work through this would be annoying, but usually we're
only recycling one server at a time.
I don't know if SSDs commonly implement
TRIM to return zero sectors
for the TRIM'd space, but for our purposes it's sufficient if they're
random garbage that won't be recognized as anything meaningful. And I
think that SSDs do do that, at least so far, and that we can probably
count on them to do it.
(SSDs might be smart enough to recognize blocks of zeros and turn them
TRIM, but why take chances and if nothing else,
easier and faster, even with the waiting afterward.)
A new and exciting failure mode for Linux UEFI booting
My work laptop boots with UEFI (and Secure Boot) instead of the traditional MBR BIOS booting, because that's what makes modern laptops (and modern versions of Windows) happy. Since it only has a single disk anyway, some of the drawbacks of UEFI booting don't apply to it. However, today I got to discover a new and exciting failure mode of UEFI booting (at least in the Fedora configuration), which is a damaged UEFI system partition FAT32 filesystem. Unfortunately both identifying this problem and fixing it are much harder than you would like, partly because GRUB 2 seems to omit reporting error messages when things go wrong loading grub.cfg.
What happened is that I powered on my laptop as normal this morning,
and when I looked back at it a bit later it was sitting there with
just a '
grub>' prompt. Some flailing around with picking an
alternate UEFI boot entry in the Dell BIOS established that my
Windows install could boot. Some poking around in the GRUB 2 shell
established that GRUB could see everything that I expected, but it
grub.cfg from the UEFI system partition, although
nothing seemed to complain (including when I manually used the
configfile' command to try to load it). Eventually I used Grub's
cat command to just dump the
grub.cfg, even though trying to
load it was producing no errors, and at that point GRUB printed
part of the file and stopped with an error about FAT32 problems.
(I don't remember the exact message at this point.)
Recovery from this started with putting together a Fedora 29 live
USB stick (a more irritating process than it should be) and booting
from it. My first step was to run
fsck against the UEFI system
partition, in which I made a mistake; when it identified various
problems, including with
grubenv, I confidently
told it to go ahead and fix things without carefully reading its
proposed fixes. The FAT32
fsck promptly truncated
0 size, losing all of the somewhat intact contents that I could
have used to boot the system with. Fixing that required setting up
a chroot environment with enough things mounted that
could run but not so many that it would hang when run (apparently
/sys present made this happen), rebooting with this somewhat
damaged grub.cfg, and then re-doing the
grub2-mkconfig to get a
new fully proper GRUB 2 config file.
(To recreate a proper
grubenv, the magic incantation is
grub2-editenv grubenv create'. GRUB 2 will complain on every boot if
you don't do this.)
As far as I can remember, I did nothing unusual with my laptop recently, although I did do a Fedora kernel upgrade (and reboot) and boot Windows to check for updates to it. There were no crashes, no abrupt or forced power-offs, no nothing that ought to have corrupted any filesystem, much less an infrequently touched UEFI system partition. But it did get corrupted. Sadly, in one sense this doesn't surprise me, because FAT32 has a reputation as a fragile file system, especially if different things update it, since various different OSes (and tools) have different FAT32 filesystem code.
(One of the strong recommendations for the FAT32 formatted memory cards used in digital cameras, for example, is that they should be formatted in the camera and only ever written to by the camera. Otherwise you risk the camera not coping with something your computer does to the filesystem or vice versa.)
Part of this issue is due to the choice to put grub.cfg into the UEFI system partition (which is not universal, see the comments on that entry). Grub.cfg is a frequently updated file, and the more often you modify a fragile filesystem the more chances you have for a problem. I don't think it's a coincidence that both grub.cfg and grubenv were damaged.
Sidebar: Why I didn't try to boot a kernel by hand from GRUB
I had two reasons for this. First, at the time I wasn't sure if
my root filesystem was intact either or if I had more widespread
issues than a problem on the UEFI system partition. Second, have
you looked at the command lines required to boot a modern kernel?
I can't possibly remember everything that goes on one or produce
it from scratch, and my laptop's
/proc/cmdline seems to be one
of the shorter ones. Specifically, it is:
BOOT_IMAGE=/vmlinuz-5.0.3-200.fc29.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet
Some of that I could probably leave out, like
in this situation I probably want to leave out '
rhgb quiet'. But
the rest clearly matters, and I didn't have another stock Fedora
system around for reference on what it should look like.
ZFS Encryption is still under development (as of March 2019)
One of the big upcoming features that a bunch of people are looking forward to in ZFS is natively encrypted filesystems. This is already in the main development tree of ZFS On Linux, will likely propagate to FreeBSD (since FreeBSD ZFS will be based on ZoL), and will make it to Illumos if the Illumos people want to pull it in. People are looking forward to native encryption so much, in fact, that some of them have started using it in ZFS On Linux already, using either the development tip or one of the 0.8.0 release candidate pre-releases (ZoL is up to 0.8.0-rc3 as of now). People either doing this or planning to do this show up on the ZoL mailing list every so often.
Unfortunately this is not a good idea (despite ZoL being in the 0.8.0 release candidate stage). Instead, you should avoid using ZFS encryption until it's part of an official release, and maybe even past that. Unlike garden variety features and changes in ZoL, where the development tree has historically been almost completely solid and problem free, ZFS encryption is such a significant change that people are still routinely finding bugs and needing to make serious changes, including changes to the on disk data format that require you to back up and restore any encrypted filesystems you may have (yes, really, and see also).
(This particular change is far from the only encryption related problem that has come up. I follow the development tree and read every commit's description, and I've seen quite a lot of commits that fix various encryption related issues. It really seems that people are still frequently finding corner cases that hadn't been considered or previously encountered, despite ZFS On Linux's relatively extensive test suite. ZFS sends and receives seem to be an especial problem area, but my memory is that even ordinary use hasn't been trouble free.)
If you have a strong need for combining encryption and ZFS today, I think that you're going to need to stick to the old approaches of things like ZFS on top of a LUKS encrypted volume. Otherwise, you should wait. The most that people should be doing with ZFS encryption today is taking it for a test drive to gain experience with it; you should definitely not use it for anything you care about.
I know, this seems odd given that ZFS On Linux is up to 0.8.0-rc3, but it is what it is. I am a little bit surprised that ZoL has been doing a -rcN series with encryption so apparently unstable, but I'm sure they have their reasons. It's even possible that ZFS On Linux 0.8.0 will not include encryption as a production-ready feature; at this point the developers probably won't disable the code outright, but they might fence it off behind warnings.
(It's possible that encryption has turned out to be more tangled and troublesome than anyone initially expected when the feature first landed, and that it's only through the early enthusiastic people jumping on it that all of these problems have been found.)
PS: I expect that FreeBSD people won't have to worry about this unless you're tracking FreeBSD-CURRENT or FreeBSD-STABLE, since I doubt that FreeBSD will enable ZFS encryption in a FreeBSD release until it's established a solid track record for stability in ZoL.
Our problem with Netplan and routes on Ubuntu 18.04
Today I've wound up getting back to our netplan dysfunction, so I think it's time to write a blog entry. Spoiler: highly specific network device names and configurations that can only be attached to or specified for a named network device interact very badly at scale.
We have a bunch of internal 'sandbox' networks, which connect to our main server subnet through various routing firewalls. Obviously the core router on our server subnet knows how to route to every sandbox; however, we also like to have the actual servers to have specific sandbox subnet routes to the appropriate routing firewall. For some servers this is merely vaguely nice; for high traffic servers (such as some NFS fileservers) it may be pretty important to avoid a pointless traffic bottleneck on the router. So many years ago we built a too-smart system to automatically generate the appropriate routes for any given host from a central set of information about what subnets were behind which gateway, and we ran it on boot to set things up. The result is a bunch of routing commands:
ip route add 10.63.0.0/16 via 188.8.131.52 ip route add 10.70.0.0/16 via 184.108.40.206 ip route add 172.31.0.0/16 via 220.127.116.11 [...]
This system is completely indifferent to what the local system's
network interface is called, which is good because in our environment
there is a huge assortment of interface names. We have
and on and on.
All of this worked great for the better part of a decade, until Ubuntu 18.04 came along with netplan. Netplan has two things that together combine to be quietly nearly fatal to what we want to do. First, the netplan setup on Ubuntu 18.04 will wipe out any 'foreign' routes it finds if and when it is re-run, which happens every so often during things like package upgrades. Second, the 18.04 version of netplan has no way to specify routes that are attached to a subnet instead of a specific named interface. If you want netplan to add extra routes to an interface, you cannot say 'associate the routes with whatever interface is on subnet <X>'; instead, you must associate the routes with an interface called <Y>, for whatever specific <Y> is in use on this system. As mentioned, <Y> is what you could call highly variable across our systems.
(Netplan claims to have some support for wildcards, but I couldn't get it to work and I don't think it ever would because it is wildcarding network interface names alone. Many of our machines have more than one network interface, and obviously only one of them is on the relevant subnet (and most of the others aren't connected to anything).)
The result is that there appears to be no good way for our perfectly
sensible desire for generic routing to interact well with netplan.
In a netplan world it appears that we should be writing and re-writing
/etc/netplan/02-cslab-routes.yaml file, but that file has to
have the name of the system's current network interface burned into
it instead of being generic. We do shuffle network interfaces around
every so often (for instance to move a system from 1G to 10G-T),
which would require us remembering that there is an additional magic
step to regenerate this file.
There are various additional problems here too, of course. First,
there appears to be no way to get netplan to redo just your routes
without touching anything else about interfaces, and we very much
want that. Second, on most systems we establish these additional
sandbox routes only after basic networking has come up and we've
NFS mounted our central administrative filesystem that has the data
file on it, which is far too late for normal netplan. I guess we'd
have to rewrite this file and then run '
(Ubuntu may love netplan a whole lot but I certainly hope no one else does.)
Taking advantage of the Linux kernel NFS server's group membership cache
Yesterday I wrote about looking at and flushing the NFS server's
group membership cache, whose current
contents are visible in
the time I was simply thinking about how to manage it, but afterward
it struck me that since it can get reasonably large, the group
membership cache will tell you some potentially quite valuable
information. Specifically, the group membership cache will often
tell you who has used your NFS server recently.
Every time an NFS(v3) request comes in from a NFS client, the kernel
needs to know the group membership of the request's UID, which means
that the request's UID will acquire an entry in
As I've seen, this happens even for
UIDs that don't exist locally and so have no group membership; these
UIDs get entries of the form '
123 0:', instead of the regular
group count and group list. Meanwhile, UIDs that have not recently
made a request to your NFS server will have their
entry expire out after no more than 30 minutes from the last use.
If you just look at
auth.unix.gid/content in normal operation,
you're not quite guaranteed to see every recent user of your NFS
server; it could be that some active UID has just hit its 30 minute
expiry and is in the process of being refreshed. If you want to be
sure you know who's using NFS server, you can flush the group
membership cache, wait an appropriate amount of time (less than 30
minutes), and look; since you flushed the cache, you know that no
current entry is old enough to expire on you in this way.
(As you'd expect and want for an authentication cache, entries always expire 30 minutes from when they're added, regardless of whether or not they're still being used.)
Flushing the cache is also one way to see who's using your NFS server over a short timespan. If you flush the cache, wait 30 seconds, and look at the contents, you have a list of all of the UIDs that made NFS requests in the last 30 seconds. If you think you have a user who's hammering away on your NFS server but you're not sure who, this could give you valuable clues. I suspect that we're going to wind up using this at some point.
(On sufficiently modern kernels you could probably extract this information and much more through eBPF, probably using bpftrace (also). Unfortunately for us, Ubuntu 18.04 and bpftrace are not currently a good combination, at least not with only stock Ubuntu repos.)
PS: Contrary to what I assumed and wrote yesterday, there doesn't seem to be any particular
size limit for the NFS server's group request cache. Perhaps there's
some sort of memory pressure lurking somewhere, but I certainly
can't see any limit on the number of entries. This means that your
auth.unix.gid really should hold absolutely everyone
who's done NFS requests recently, especially after you flush the
cache to reset all of the entry expiry times.
How to see and flush the Linux kernel NFS server's group membership cache
One of the long standing limits with NFS v3 is that the protocol
only uses up to 16 groups. In order to
get around this and properly support people in more than 16 groups,
various Unixes have various fixes.
Linux has supported this for many years (since at least 2011)
if you run
--manage-gids. If you do
use this option, well, I'll just quote the
Accept requests from the kernel to map user id numbers into lists of group id numbers for use in access control. [...] If you use the
-gflag, then the list of group ids received from the client will be replaced by a list of group ids determined by an appropriate lookup on the server. Note that the 'primary' group id is not affected so a
newgroupcommand on the client will still be effective. [...]
As this mentions, the 'appropriate lookup' is performed by
when the kernel asks it to do one. As you'd expect,
uses whatever normal group membership lookup methods are configured
on the NFS server in
nsswitch.conf (it just calls
As you might expect, the kernel maintains a cache of this group
membership information so that it doesn't have to flood
with lookups of the same information (and slow down handling NFS
requests as it waits for answers), much like it maintains a client
authentication cache. The group
membership cache is handled with the same general mechanisms as
the client authentication cache,
which are sort of covered in the nfsd(7) manpage.
The group cache's various control files are found in
/proc/net/rpc/auth.unix.gid, and they work the same as
There is a
content file that lets you see the currently cached
data, which comes in the form:
#uid cnt: gids... 915 11: 125 832 930 1010 1615 30062 30069 30151 30216 31061 31091
Occasionally you may see an entry like '
123 0:'. I believe that
this is generally an NFS request from a UID that wasn't known on
the NFS fileserver; since it wasn't know, it has no local groups
rpc.mountd reported to the kernel that it's in no groups.
All entries have a TTL, which is unfortunately not reported in the
rpc.mountd uses its standard TTL of 30
minutes when adding entries and then they count down from there,
with the practical effect that anything you see will expire at some
unpredictable time within the next 30 minutes. You can flush all
entries by writing a future time in Unix seconds to the
file. For example:
date -d tomorrow +%s >auth.unix.gid/flush
This may be useful if you have added someone to a group, propagated the group update to your Linux NFS servers, and want them to immediately have NFS client access to files that are group-restricted to that group.
On sufficiently modern kernels, this behavior has been loosened
flush files of caches) so that writing any number at all
flush will flush the entire cache. This change was introduced
in early 2018 by Neil Brown, in this commit.
Based on its position in the history of the kernel tree, I believe
that this was first present in 4.17.0 (which unfortunately means
that it's a bit too late to be in our Ubuntu 18.04 NFS fileservers).
Presumably there is a size limit on how large the kernel's group
cache can be, but I don't know what it is. At the moment, there are
just over 550 entries in
content on our most broadly used Linux
NFS fileserver (it holds
/var/mail, so a lot of people access
things from it).
Process states from
/proc/stat's running and blocked numbers
We recently updated to a version of the Prometheus host agent that can report on
how many processes are in various process states.
The host agent has also long reported
node_procs_blocked metrics, which ultimately come from
Naturally, I cross-compared the two different sets of numbers.
To my surprise, in our environment they could be significantly
different from each other. There turn out to be two reasons for
this, one for each
As far as
procs_running goes, it was always higher than the
number of processes that Prometheus reported as being in state
R'. This turns out to be because Prometheus was counting only
processes, because it looks at what appears in
procs_running counts all threads. When you have a multi-threaded
program, only the main process (or thread) shows up directly in
/proc and so has its
/proc/[pid]/stat inspected. Depending on
how the threading in your program is constructed, this can give
you all sorts of running threads but an idle main process.
(This seems to be what happens with Go programs, including the
Prometheus host agent itself. On otherwise idle machines, the
host agent will routinely report no processes in state
anywhere from 5 to 10 threads in
procs_running. On the same
machine, directly '
/proc/stat consistently reports one
process running, presumably the
The difference between
procs_blocked and processes in state
D' is partly this difference between processes and threads, but
they are also measuring slightly different things.
counts threads that are blocked on real disk IO (technically block
IO), while the '
D' process state is really counting processes
that are in an uninterruptible sleep (in the state
TASK_UNINTERRUPTIBLE, with a caveat about '
I' processes from
my earlier entry). Most processes in state '
are waiting on IO in some form, but there are other reasons processes
can wind up in this state.
In particular, processes waiting on NFS IO will be in state '
but not be counted in
procs_blocked. Processes waiting for
NFS IO are part of
%iowait but since they
are not performing actual block IO, they are not counted in
procs_blocked. You can use this to tell why processes (or
threads) are in an IO wait state; if
procs_blocked is high,
they are waiting on block IO, and if they are just in state '
they are waiting for something else.
(I believe that anything that operates at the block IO layer will
show up in
procs_blocked. I suspect that this includes iSCSI,
among other things.)
Since we make a lot of use of NFS and some machines can be waiting on either NFS or local IO (or sometimes both), I suspect that we're going to have uses for this knowledge. It definitely means that we want to show both metrics in our Grafana dashboards.
The modern danger of locales when you combine sort and cron
It never fails. Every time I use 'sort' in a shell script to be run from cron on a modern Linux machine, it blows up in my face because $LANG defaults to some locale that screws up traditional sort order. I need to start all scripts with:
LANG=C; export LANG
(I wound up elaborating on this, because the trap here is not obvious.)
On modern Linux machines, cron runs your cron jobs with the system's
default locale set (as does systemd when running scripts as part
of systemd units), and that locale is almost certainly one that
screws up the traditional sort order (because almost all of them
do), both for
sort and for other things. If you've set your personal shell
environment up appropriately, your scripts work in your environment
and they even work when you
sudo to root, but then you
deploy them to cron and they fail. If you're lucky, they fail loudly.
This is especially clever and dangerous if you're adding a
to a script that didn't previously need it, as I was today. The
sort version of your script worked even from cron; the new
version is only a small change and it works when you run it by
hand, but now it fails when cron runs it. In my case the failure
was silent and I had to notice from side effects.
(Prometheus's Pushgateway very much
dislikes the order that
sort gives you in the en_US.UTF-8
locale. It turns out that my use of
sort here was actually
superstitious in my particular situation, although there are
other closely related ones where I've needed it.)
I should probably make immediately setting either
$LC_COLLATE a standard part of every shell script that I write
or modify. Even if it's not necessary today, it will be someday in
the future, and it's pretty clear that I won't always remember to
add it when it becomes necessary.