Wandering Thoughts

2019-04-15

How Linux starts non-system software RAID arrays during boot under systemd

In theory, you do not need to care about how your Linux software RAID arrays get assembled and started during boot because it all just works. In practice, sometimes you do, and on a modern systemd-based Linux this seems to be an unusually tangled situation. So here is what I can determine so far about how it works for software RAID arrays that are assembled and started outside of the initramfs, after your system has mounted your real root filesystem and is running from it.

(How things work for starting software RAID arrays in the initramfs is quite varied between Linux distributions. There is some distribution variation even for post-initramfs booting, but these days the master version of mdadm ships canonical udev and systemd scripts, services, and so on and I think most distributions use them almost unchanged.)

As has been the case for some time, the basic work is done through udev rules. On a typical Linux system, the main udev rule file for assembly will be called something like 64-md-raid-assembly.rules and be basically the upstream mdadm version. Udev itself identifies block devices that are potentially Linux RAID members (probably mostly based on the presence of RAID superblocks), and mdadm's udev rules then run mdadm in a special incremental assembly mode on them. To quote the manpage:

This mode is designed to be used in conjunction with a device discovery system. As devices are found in a system, they can be passed to mdadm --incremental to be conditionally added to an appropriate array.

As array components become visible to udev and cause it to run mdadm --incremental on them, mdadm progressively adds them to the array. When the final device is added, mdadm will start the array. This makes the software RAID array and its contents visible to udev and to systemd, where it will be used to satisfy dependencies for things like /etc/fstab mounts and thus trigger them happening.

(There are additional mdadm udev rules for setting up device names, starting mdadm monitoring, and so on. And then there's a whole collection of general udev rules and other activities to do things like read the UUIDs of filesystems from new block devices.)

However, all of this only happens if all of the array component devices show up in udev (and show up fast enough); if only some of the devices show up, the software RAID will be partially assembled by mdadm --incremental but not started because it's not complete. To deal with this situation and eventually start software RAID arrays in degraded mode, mdadm's udev rules start a systemd timer unit when enough of the array is present to let it run degraded, specifically the templated timer unit mdadm-last-resort@.timer (so for md0 the specific unit is mdadm-last-resort@md0.timer). If the RAID array isn't assembled and the timer goes off, it triggers the corresponding templated systemd service unit, using mdadm-last-resort@.service, which runs 'mdadm --run' on your degraded array to start it.

(The timer unit is only started when mdadm's incremental assembly reports back that it's 'unsafe' to assemble the array, as opposed to impossible. Mdadm reports this only once there are enough component devices present to run the array in a degraded mode; how many devices are required (and what devices) depends on the specific RAID level. RAID-1 arrays, for example, only require one component device to be 'unsafe'.)

Because there's an obvious race potential here, the systemd timer and service both work hard to not act if the RAID array is actually present and already started. The timer conflicts with 'sys-devices-virtual-block-<array>.device', the systemd device unit representing the RAID array, and as an extra safety measure the service refuses to run if the RAID array appears to be present in /sys/devices. In addition, the udev rule that triggers systemd starting the timer unit will only act on software RAID devices that appear to belong to this system, either because they're listed in your mdadm.conf or because their home host is this host.

(This is the MD_FOREIGN match in the udev rules. The environment variables come from mdadm's --export option, which is used during udev incremental assembly. Mdadm's code for incremental assembly, which also generates these environment variables, is in Incremental.c. The important enough() function is in util.c.)

As far as I know, none of this is documented or official; it's just how mdadm, udev, and systemd all behave and interact at the moment. However this appears to be pretty stable and long standing, so it's probably going to keep being the case in the future.

PS: As far as I can tell, all of this means that there are no real user-accessible controls for whether or not degraded software RAID arrays are started on boot. If you want to specifically block degraded starts of some RAID arrays, it might work to 'systemctl mask' either or both of the last-resort timer and service unit for the array. If you want to always start degraded arrays, well, the good news is that that's supposed to happen automatically.

SoftwareRaidAssemblySystemd written at 22:37:24; Add Comment

2019-04-13

WireGuard was pleasantly easy to get working behind a NAT (or several)

Normally, my home machine is directly connected to the public Internet by its DSL connection. However, every so often this DSL connection falls over, and these days my backup method of Internet connectivity is that I tether my home machine through my phone. This tethering gives me an indirect Internet connection; my desktop is on a little private network provided by my phone and then my phone NAT's my outgoing traffic. Probably my cellular provider adds another level of NAT as well, and certainly the public IP address that all of my traffic appears from can hop around between random IPs and random networks.

Most of the time this works well enough for basic web browsing and even SSH sessions, but it has two problems when I'm connecting to things at work. The first is that my public IP address can change even while I have a SSH connection present (but perhaps not active enough), which naturally breaks the SSH connection. The second is that I only have 'outside' access to our servers; I can only SSH to or otherwise access machines that are accessible from the Internet, which excludes most of the interesting and important ones.

Up until recently I've just lived with this, because the whole issue just doesn't come up often enough to get me to do anything about it. Then this morning my home DSL connection died at a fairly inopportune time, when I was scheduled to do something from home that involved both access to internal machines and things that very much shouldn't risk having my SSH sessions cut off in mid-flight (and that I couldn't feasibly do from within a screen session, because it involved multiple windows). I emailed a co-worker to have them take over, which they fortunately were able to do, and then I decided to spend a little time to see if I could get my normal WireGuard tunnel up and running over my tethered and NAT'd phone connection, instead of its usual DSL setup. If I could bring up my WireGuard tunnel, I'd have both a stable IP for SSH sessions and access to our internal systems even when I had to use my fallback Internet option.

(I won't necessarily have uninterrupted SSH sessions, because if my phone changed public IPs there will be a pause as WireGuard re-connected and so on. But at least I'll have the chance to have sessions continue afterward, instead of being intrinsically broken.)

Well, the good news is that my WireGuard setup basically just worked as-is when I brought it up behind however many layers of NAT'ing are going on. The actual WireGuard configuration needed no changes and I only had to do some minor tinkering with my setup for policy-based routing (and one of the issues was my own fault). It was sufficiently easy that now I feel a bit silly for having not tried it before now.

(Things would not have been so easy if I'd decided to restrict what IP addresses could talk to WireGuard on my work machine, as I once considered doing.)

This is of course how WireGuard is supposed to work. Provided that you can pass its UDP traffic in both ways (which fortunately seems to work through the NAT'ing involved in my case), WireGuard doesn't care where your traffic comes from if it has the right keys, and your server will automatically update its idea of what (external) IP your client has right now when it gets new traffic, which makes everything work out.

(WireGuard is actually symmetric; either end will update its idea of the other end's IP when it gets appropriate traffic. It's just that under most circumstances your server end rarely changes its outgoing IP.)

I knew that in theory all of this should work, but it's still nice to have it actually work out in practice, especially in a situation with at least one level of NAT going on. I'm actually a little bit amazed that it does work through all of the NAT magic going on, especially since WireGuard is just UDP packets flying back and forth instead of a TCP connection (which any NAT had better be able to handle).

On a side note, although I did everything by hand this morning, in theory I could automate all of this through dhclient hook scripts, which I'm already using to manage my resolv.conf (as covered in this entry). Of course this brings up a little issue, because if the WireGuard tunnel is up and working I actually want to use my regular resolv.conf instead of the one I switch to when I'm tethering (without WireGuard). Probably I'm going to defer all of this until the next time my DSL connection goes down.

WireGuardBehindNAT written at 00:16:23; Add Comment

2019-04-05

I won't be trying out ZFS's new TRIM support for a while

ZFS on Linux's development version has just landed support for using TRIM commands on SSDs in order to keep their performance up as you write more data to them and the SSD thinks it's more and more full; you can see the commit here and there's more discussion in the pull request. This is an exciting development in general, and since ZoL 0.8.0 is in the release candidate stage at the moment, this TRIM support might even make its way into a full release in the not too distant future.

Normally, you might expect me to give this a try, as I have with other new things like sequential scrubs. I've tracked the ZoL development tree on my own machines for years basically without problems, and I definitely have fairly old pools on SSDs that could likely benefit from being TRIM'd. However, I haven't so much as touched the new TRIM support and probably won't for some time.

Some projects have a relatively unstable development tree where running it can routinely or periodically destabilize your environment and expose you to bugs. ZFS on Linux is not like this; historically the code that has landed in the development version has been quite stable and problem free. Code in the ZoL tree is almost always less 'in development' and more 'not in a release yet', partly because ZoL has solid development practices along with significant amounts of automated tests. As you can read in the 'how has this been tested?' section of the pull request, the TRIM code has been carefully exercised both through specific new tests and random invocation of TRIM through other tests.

All of this is true, but then there is the small fact that in practice, ZFS encryption is not ready yet despite having been in the ZoL development tree for some time. This isn't because ZFS encryption is bad code (or untested code); it's because ZFS encryption turns out to be complicated and to interact with lots of other things. The TRIM feature is probably less complicated than encryption, but it's not simple, there are plenty of potential corner cases, and life is complicated by potential issues in how real SSDs do or don't cope well with TRIM commands being issued in the way that ZoL will. Also, an errant TRIM operation inherently destroys some of your data, because that's what TRIM does.

All of this makes me feel that TRIM is inherently much more dangerous than the usual ZoL new feature, sufficiently dangerous that I don't feel confident enough to try it. This time around, I'm going to let other people do the experimentation and collect the arrows in their backs. I will probably only start using ZFS TRIM once it's in a released version and a number of people have used it for a while without explosions.

If you feel experimental despite this, I note that according to the current manpage an explicit 'zpool trim' can apparently be limited to a single disk. I would definitely suggest using it that way (on a pool with redundancy); TRIM a single disk, wait for the disk to settle and finish everything, and then scrub your pool to verify that nothing got damaged in your particular setup. This is definitely how I'm going to start with ZFS TRIM, when I eventually do.

(On my work machine, I'm still tracking the ZoL tree so I'm using a version with TRIM available; I'm just not enabling it. On my home machine, for various reasons, I've currently frozen my ZoL version at a point just before TRIM landed, just in case. I have to admit that stopping updating ZoL does make the usual kernel update dance an easier thing, especially since WireGuard has stopped updating so frequently.)

ZFSNoTrimForMeYet written at 21:19:02; Add Comment

2019-03-31

Erasing SSDs with blkdiscard (on Linux)

Our approach to upgrading servers by reinstalling them from scratch on new hardware means that we have a slow flow of previously used servers that we're going to reuse, and thus that need their disks cleaned up from their previous life. Some places would do this for data security reasons, but here we mostly care that lingering partitioning, software RAID superblocks, and so on don't cause us problems on new OS installs.

In the old days of HDs, we generally did this by zeroing out the old drives with dd (on a machine dedicated to the purpose which was just left running in the corner, since this takes some time with HDs), or sometimes with a full badblocks scan. When we started using SSDs in our servers, this didn't seem like such a good idea any more. We didn't really want to use up some of the SSD write endurance just to blank them out or worse, to write over them repeatedly with badblocks.

Our current solution to this is blkdiscard, which basically sends a TRIM command to the SSD. Conveniently, the Ubuntu 18.04 server CD image that we use as the base for our install images contains blkdiscard, so we can boot a decommissioned server from install media, wait for the Ubuntu installer to initialize and find all the disks, and then switch over to a text console to blkdiscard its SSDs. In the process of doing this a few times, I have developed a process and learned some useful lessons.

First, just to be sure and in an excess of caution, I usually explicitly zero the very start of each disk with 'dd if=/dev/zero of=/dev/sdX bs=1024k count=128; sync' (the count can vary). This at least zeroes out the MBR partition no matter what. Then when I use blkdiscard, I generally background it because I've found that it can take a while to finish and I may have more than one disk to blank out:

# blkdiscard /dev/sda &
# blkdiscard /dev/sdb &
# wait

I could do them one at a time, but precisely because it can take a while I usually wander away from the server to do other things. This gets everything done all at once, so I don't have to wait twice.

Finally, after I've run blkdiscard and it's finished, I usually let the server sit there running for a while. This is probably superstition, but I feel like giving the SSDs time to process the TRIM operation before either resetting them with a system reboot or powering the server off (with a 'poweroff', which is theoretically orderly). If I had a bunch of SSDs to work through this would be annoying, but usually we're only recycling one server at a time.

I don't know if SSDs commonly implement TRIM to return zero sectors for the TRIM'd space, but for our purposes it's sufficient if they're random garbage that won't be recognized as anything meaningful. And I think that SSDs do do that, at least so far, and that we can probably count on them to do it.

(SSDs might be smart enough to recognize blocks of zeros and turn them into TRIM, but why take chances and if nothing else, blkdiscard is easier and faster, even with the waiting afterward.)

ErasingSSDsWithBlkdiscard written at 00:58:06; Add Comment

2019-03-27

A new and exciting failure mode for Linux UEFI booting

My work laptop boots with UEFI (and Secure Boot) instead of the traditional MBR BIOS booting, because that's what makes modern laptops (and modern versions of Windows) happy. Since it only has a single disk anyway, some of the drawbacks of UEFI booting don't apply to it. However, today I got to discover a new and exciting failure mode of UEFI booting (at least in the Fedora configuration), which is a damaged UEFI system partition FAT32 filesystem. Unfortunately both identifying this problem and fixing it are much harder than you would like, partly because GRUB 2 seems to omit reporting error messages when things go wrong loading grub.cfg.

What happened is that I powered on my laptop as normal this morning, and when I looked back at it a bit later it was sitting there with just a 'grub>' prompt. Some flailing around with picking an alternate UEFI boot entry in the Dell BIOS established that my Windows install could boot. Some poking around in the GRUB 2 shell established that GRUB could see everything that I expected, but it wasn't loading grub.cfg from the UEFI system partition, although nothing seemed to complain (including when I manually used the 'configfile' command to try to load it). Eventually I used Grub's cat command to just dump the grub.cfg, even though trying to load it was producing no errors, and at that point GRUB printed part of the file and stopped with an error about FAT32 problems.

(I don't remember the exact message at this point.)

Recovery from this started with putting together a Fedora 29 live USB stick (a more irritating process than it should be) and booting from it. My first step was to run fsck against the UEFI system partition, in which I made a mistake; when it identified various problems, including with grub.cfg and grubenv, I confidently told it to go ahead and fix things without carefully reading its proposed fixes. The FAT32 fsck promptly truncated grub.cfg to 0 size, losing all of the somewhat intact contents that I could have used to boot the system with. Fixing that required setting up a chroot environment with enough things mounted that grub2-mkconfig could run but not so many that it would hang when run (apparently having /sys present made this happen), rebooting with this somewhat damaged grub.cfg, and then re-doing the grub2-mkconfig to get a new fully proper GRUB 2 config file.

(To recreate a proper grubenv, the magic incantation is 'grub2-editenv grubenv create'. GRUB 2 will complain on every boot if you don't do this.)

As far as I can remember, I did nothing unusual with my laptop recently, although I did do a Fedora kernel upgrade (and reboot) and boot Windows to check for updates to it. There were no crashes, no abrupt or forced power-offs, no nothing that ought to have corrupted any filesystem, much less an infrequently touched UEFI system partition. But it did get corrupted. Sadly, in one sense this doesn't surprise me, because FAT32 has a reputation as a fragile file system, especially if different things update it, since various different OSes (and tools) have different FAT32 filesystem code.

(One of the strong recommendations for the FAT32 formatted memory cards used in digital cameras, for example, is that they should be formatted in the camera and only ever written to by the camera. Otherwise you risk the camera not coping with something your computer does to the filesystem or vice versa.)

Part of this issue is due to the choice to put grub.cfg into the UEFI system partition (which is not universal, see the comments on that entry). Grub.cfg is a frequently updated file, and the more often you modify a fragile filesystem the more chances you have for a problem. I don't think it's a coincidence that both grub.cfg and grubenv were damaged.

Sidebar: Why I didn't try to boot a kernel by hand from GRUB

I had two reasons for this. First, at the time I wasn't sure if my root filesystem was intact either or if I had more widespread issues than a problem on the UEFI system partition. Second, have you looked at the command lines required to boot a modern kernel? I can't possibly remember everything that goes on one or produce it from scratch, and my laptop's /proc/cmdline seems to be one of the shorter ones. Specifically, it is:

BOOT_IMAGE=/vmlinuz-5.0.3-200.fc29.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet

Some of that I could probably leave out, like BOOT_IMAGE, and in this situation I probably want to leave out 'rhgb quiet'. But the rest clearly matters, and I didn't have another stock Fedora system around for reference on what it should look like.

UEFIPartitionCorruption written at 22:05:15; Add Comment

2019-03-19

ZFS Encryption is still under development (as of March 2019)

One of the big upcoming features that a bunch of people are looking forward to in ZFS is natively encrypted filesystems. This is already in the main development tree of ZFS On Linux, will likely propagate to FreeBSD (since FreeBSD ZFS will be based on ZoL), and will make it to Illumos if the Illumos people want to pull it in. People are looking forward to native encryption so much, in fact, that some of them have started using it in ZFS On Linux already, using either the development tip or one of the 0.8.0 release candidate pre-releases (ZoL is up to 0.8.0-rc3 as of now). People either doing this or planning to do this show up on the ZoL mailing list every so often.

Unfortunately this is not a good idea (despite ZoL being in the 0.8.0 release candidate stage). Instead, you should avoid using ZFS encryption until it's part of an official release, and maybe even past that. Unlike garden variety features and changes in ZoL, where the development tree has historically been almost completely solid and problem free, ZFS encryption is such a significant change that people are still routinely finding bugs and needing to make serious changes, including changes to the on disk data format that require you to back up and restore any encrypted filesystems you may have (yes, really, and see also).

(This particular change is far from the only encryption related problem that has come up. I follow the development tree and read every commit's description, and I've seen quite a lot of commits that fix various encryption related issues. It really seems that people are still frequently finding corner cases that hadn't been considered or previously encountered, despite ZFS On Linux's relatively extensive test suite. ZFS sends and receives seem to be an especial problem area, but my memory is that even ordinary use hasn't been trouble free.)

If you have a strong need for combining encryption and ZFS today, I think that you're going to need to stick to the old approaches of things like ZFS on top of a LUKS encrypted volume. Otherwise, you should wait. The most that people should be doing with ZFS encryption today is taking it for a test drive to gain experience with it; you should definitely not use it for anything you care about.

I know, this seems odd given that ZFS On Linux is up to 0.8.0-rc3, but it is what it is. I am a little bit surprised that ZoL has been doing a -rcN series with encryption so apparently unstable, but I'm sure they have their reasons. It's even possible that ZFS On Linux 0.8.0 will not include encryption as a production-ready feature; at this point the developers probably won't disable the code outright, but they might fence it off behind warnings.

(It's possible that encryption has turned out to be more tangled and troublesome than anyone initially expected when the feature first landed, and that it's only through the early enthusiastic people jumping on it that all of these problems have been found.)

PS: I expect that FreeBSD people won't have to worry about this unless you're tracking FreeBSD-CURRENT or FreeBSD-STABLE, since I doubt that FreeBSD will enable ZFS encryption in a FreeBSD release until it's established a solid track record for stability in ZoL.

ZFSEncryptionNotReady written at 23:05:24; Add Comment

2019-03-07

Our problem with Netplan and routes on Ubuntu 18.04

I tweeted:

Today I've wound up getting back to our netplan dysfunction, so I think it's time to write a blog entry. Spoiler: highly specific network device names and configurations that can only be attached to or specified for a named network device interact very badly at scale.

We have a bunch of internal 'sandbox' networks, which connect to our main server subnet through various routing firewalls. Obviously the core router on our server subnet knows how to route to every sandbox; however, we also like to have the actual servers to have specific sandbox subnet routes to the appropriate routing firewall. For some servers this is merely vaguely nice; for high traffic servers (such as some NFS fileservers) it may be pretty important to avoid a pointless traffic bottleneck on the router. So many years ago we built a too-smart system to automatically generate the appropriate routes for any given host from a central set of information about what subnets were behind which gateway, and we ran it on boot to set things up. The result is a bunch of routing commands:

ip route add 10.63.0.0/16 via 128.100.3.5
ip route add 10.70.0.0/16 via 128.100.3.4
ip route add 172.31.0.0/16 via 128.100.3.6
[...]

This system is completely indifferent to what the local system's network interface is called, which is good because in our environment there is a huge assortment of interface names. We have eno1, enp3s0f0, enp4s0f0, enp4s0, enp11s0f0, enp7s0, enp1s0f0, and on and on.

All of this worked great for the better part of a decade, until Ubuntu 18.04 came along with netplan. Netplan has two things that together combine to be quietly nearly fatal to what we want to do. First, the netplan setup on Ubuntu 18.04 will wipe out any 'foreign' routes it finds if and when it is re-run, which happens every so often during things like package upgrades. Second, the 18.04 version of netplan has no way to specify routes that are attached to a subnet instead of a specific named interface. If you want netplan to add extra routes to an interface, you cannot say 'associate the routes with whatever interface is on subnet <X>'; instead, you must associate the routes with an interface called <Y>, for whatever specific <Y> is in use on this system. As mentioned, <Y> is what you could call highly variable across our systems.

(Netplan claims to have some support for wildcards, but I couldn't get it to work and I don't think it ever would because it is wildcarding network interface names alone. Many of our machines have more than one network interface, and obviously only one of them is on the relevant subnet (and most of the others aren't connected to anything).)

The result is that there appears to be no good way for our perfectly sensible desire for generic routing to interact well with netplan. In a netplan world it appears that we should be writing and re-writing a /etc/netplan/02-cslab-routes.yaml file, but that file has to have the name of the system's current network interface burned into it instead of being generic. We do shuffle network interfaces around every so often (for instance to move a system from 1G to 10G-T), which would require us remembering that there is an additional magic step to regenerate this file.

There are various additional problems here too, of course. First, there appears to be no way to get netplan to redo just your routes without touching anything else about interfaces, and we very much want that. Second, on most systems we establish these additional sandbox routes only after basic networking has come up and we've NFS mounted our central administrative filesystem that has the data file on it, which is far too late for normal netplan. I guess we'd have to rewrite this file and then run 'netplan apply'.

(Ubuntu may love netplan a whole lot but I certainly hope no one else does.)

NetplanRoutesProblem written at 01:22:52; Add Comment

2019-02-28

Taking advantage of the Linux kernel NFS server's group membership cache

Yesterday I wrote about looking at and flushing the NFS server's group membership cache, whose current contents are visible in /proc/net/rpc/auth.unix.gid/content. At the time I was simply thinking about how to manage it, but afterward it struck me that since it can get reasonably large, the group membership cache will tell you some potentially quite valuable information. Specifically, the group membership cache will often tell you who has used your NFS server recently.

Every time an NFS(v3) request comes in from a NFS client, the kernel needs to know the group membership of the request's UID, which means that the request's UID will acquire an entry in auth.unix.gid. As I've seen, this happens even for UIDs that don't exist locally and so have no group membership; these UIDs get entries of the form '123 0:', instead of the regular group count and group list. Meanwhile, UIDs that have not recently made a request to your NFS server will have their auth.unix.gid entry expire out after no more than 30 minutes from the last use.

If you just look at auth.unix.gid/content in normal operation, you're not quite guaranteed to see every recent user of your NFS server; it could be that some active UID has just hit its 30 minute expiry and is in the process of being refreshed. If you want to be sure you know who's using NFS server, you can flush the group membership cache, wait an appropriate amount of time (less than 30 minutes), and look; since you flushed the cache, you know that no current entry is old enough to expire on you in this way.

(As you'd expect and want for an authentication cache, entries always expire 30 minutes from when they're added, regardless of whether or not they're still being used.)

Flushing the cache is also one way to see who's using your NFS server over a short timespan. If you flush the cache, wait 30 seconds, and look at the contents, you have a list of all of the UIDs that made NFS requests in the last 30 seconds. If you think you have a user who's hammering away on your NFS server but you're not sure who, this could give you valuable clues. I suspect that we're going to wind up using this at some point.

(On sufficiently modern kernels you could probably extract this information and much more through eBPF, probably using bpftrace (also). Unfortunately for us, Ubuntu 18.04 and bpftrace are not currently a good combination, at least not with only stock Ubuntu repos.)

PS: Contrary to what I assumed and wrote yesterday, there doesn't seem to be any particular size limit for the NFS server's group request cache. Perhaps there's some sort of memory pressure lurking somewhere, but I certainly can't see any limit on the number of entries. This means that your server's auth.unix.gid really should hold absolutely everyone who's done NFS requests recently, especially after you flush the cache to reset all of the entry expiry times.

NFSServerUsingGroupCache written at 23:32:17; Add Comment

2019-02-27

How to see and flush the Linux kernel NFS server's group membership cache

One of the long standing limits with NFS v3 is that the protocol only uses up to 16 groups. In order to get around this and properly support people in more than 16 groups, various Unixes have various fixes. Linux has supported this for many years (since at least 2011) if you run rpc.mountd with -g aka --manage-gids. If you do use this option, well, I'll just quote the rpc.mountd manpage:

Accept requests from the kernel to map user id numbers into lists of group id numbers for use in access control. [...] If you use the -g flag, then the list of group ids received from the client will be replaced by a list of group ids determined by an appropriate lookup on the server. Note that the 'primary' group id is not affected so a newgroup command on the client will still be effective. [...]

As this mentions, the 'appropriate lookup' is performed by rpc.mountd when the kernel asks it to do one. As you'd expect, rpc.mountd uses whatever normal group membership lookup methods are configured on the NFS server in nsswitch.conf (it just calls getpwuid(3) and getgrouplist(3) in mountd/cache.c).

As you might expect, the kernel maintains a cache of this group membership information so that it doesn't have to flood rpc.mountd with lookups of the same information (and slow down handling NFS requests as it waits for answers), much like it maintains a client authentication cache. The group membership cache is handled with the same general mechanisms as the client authentication cache, which are sort of covered in the nfsd(7) manpage.

The group cache's various control files are found in /proc/net/rpc/auth.unix.gid, and they work the same as auth.unix.ip. There is a content file that lets you see the currently cached data, which comes in the form:

#uid cnt: gids...
915 11: 125 832 930 1010 1615 30062 30069 30151 30216 31061 31091

Occasionally you may see an entry like '123 0:'. I believe that this is generally an NFS request from a UID that wasn't known on the NFS fileserver; since it wasn't know, it has no local groups and so rpc.mountd reported to the kernel that it's in no groups.

All entries have a TTL, which is unfortunately not reported in the content pseudo-file; rpc.mountd uses its standard TTL of 30 minutes when adding entries and then they count down from there, with the practical effect that anything you see will expire at some unpredictable time within the next 30 minutes. You can flush all entries by writing a future time in Unix seconds to the flush file. For example:

date -d tomorrow +%s >auth.unix.gid/flush

This may be useful if you have added someone to a group, propagated the group update to your Linux NFS servers, and want them to immediately have NFS client access to files that are group-restricted to that group.

On sufficiently modern kernels, this behavior has been loosened (for all flush files of caches) so that writing any number at all to flush will flush the entire cache. This change was introduced in early 2018 by Neil Brown, in this commit. Based on its position in the history of the kernel tree, I believe that this was first present in 4.17.0 (which unfortunately means that it's a bit too late to be in our Ubuntu 18.04 NFS fileservers).

Presumably there is a size limit on how large the kernel's group cache can be, but I don't know what it is. At the moment, there are just over 550 entries in content on our most broadly used Linux NFS fileserver (it holds /var/mail, so a lot of people access things from it).

NFSFlushingServerGroupCache written at 23:02:33; Add Comment

2019-02-24

Process states from /proc/[pid]/stat versus /proc/stat's running and blocked numbers

We recently updated to a version of the Prometheus host agent that can report on how many processes are in various process states. The host agent has also long reported node_procs_running and node_procs_blocked metrics, which ultimately come from /proc/stat's procs_running and procs_blocked fields. Naturally, I cross-compared the two different sets of numbers. To my surprise, in our environment they could be significantly different from each other. There turn out to be two reasons for this, one for each /proc/stat field.

As far as procs_running goes, it was always higher than the number of processes that Prometheus reported as being in state 'R'. This turns out to be because Prometheus was counting only processes, because it looks at what appears in /proc, while procs_running counts all threads. When you have a multi-threaded program, only the main process (or thread) shows up directly in /proc and so has its /proc/[pid]/stat inspected. Depending on how the threading in your program is constructed, this can give you all sorts of running threads but an idle main process.

(This seems to be what happens with Go programs, including the Prometheus host agent itself. On otherwise idle machines, the host agent will routinely report no processes in state R but anywhere from 5 to 10 threads in procs_running. On the same machine, directly 'cat'ing /proc/stat consistently reports one process running, presumably the cat itself.)

The difference between procs_blocked and processes in state 'D' is partly this difference between processes and threads, but they are also measuring slightly different things. procs_blocked counts threads that are blocked on real disk IO (technically block IO), while the 'D' process state is really counting processes that are in an uninterruptible sleep (in the state TASK_UNINTERRUPTIBLE, with a caveat about 'I' processes from my earlier entry). Most processes in state 'D' are waiting on IO in some form, but there are other reasons processes can wind up in this state.

In particular, processes waiting on NFS IO will be in state 'D' but not be counted in procs_blocked. Processes waiting for NFS IO are part of %iowait but since they are not performing actual block IO, they are not counted in procs_blocked. You can use this to tell why processes (or threads) are in an IO wait state; if procs_blocked is high, they are waiting on block IO, and if they are just in state 'D', they are waiting for something else.

(I believe that anything that operates at the block IO layer will show up in procs_blocked. I suspect that this includes iSCSI, among other things.)

Since we make a lot of use of NFS and some machines can be waiting on either NFS or local IO (or sometimes both), I suspect that we're going to have uses for this knowledge. It definitely means that we want to show both metrics in our Grafana dashboards.

ProcessStatesAndProcStat written at 23:37:57; Add Comment

The modern danger of locales when you combine sort and cron

I tweeted:

It never fails. Every time I use 'sort' in a shell script to be run from cron on a modern Linux machine, it blows up in my face because $LANG defaults to some locale that screws up traditional sort order. I need to start all scripts with:

LANG=C; export LANG

(I wound up elaborating on this, because the trap here is not obvious.)

On modern Linux machines, cron runs your cron jobs with the system's default locale set (as does systemd when running scripts as part of systemd units), and that locale is almost certainly one that screws up the traditional sort order (because almost all of them do), both for sort and for other things. If you've set your personal shell environment up appropriately, your scripts work in your environment and they even work when you su or sudo to root, but then you deploy them to cron and they fail. If you're lucky, they fail loudly.

This is especially clever and dangerous if you're adding a sort to a script that didn't previously need it, as I was today. The pre-sort version of your script worked even from cron; the new version is only a small change and it works when you run it by hand, but now it fails when cron runs it. In my case the failure was silent and I had to notice from side effects.

(Prometheus's Pushgateway very much dislikes the order that sort gives you in the en_US.UTF-8 locale. It turns out that my use of sort here was actually superstitious in my particular situation, although there are other closely related ones where I've needed it.)

I should probably make immediately setting either $LANG or $LC_COLLATE a standard part of every shell script that I write or modify. Even if it's not necessary today, it will be someday in the future, and it's pretty clear that I won't always remember to add it when it becomes necessary.

SortCronLocaleDanger written at 01:43:06; Add Comment

(Previous 11 or go back to February 2019 at 2019/02/20)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.