Wandering Thoughts archives

2019-04-24

How we're making updated versions of a file rapidly visible on our Linux NFS clients

Part of our automounter replacement is a file with a master list of all NFS mounts that client machines should have, which we hold in our central administrative filesystem that all clients NFS mount. When we migrate filesystems from our old fileservers to our new fileservers, one of the steps is to regenerate this list with the old filesystem mount not present, then run a mount update on all of the NFS clients to actually unmount the filesystem from the old fileserver. For a long time, we almost always had to wait a bit of time before all of the NFS clients would reliably see the new version of the NFS mounts file, which had the unfortunate effect of slowing down filesystem migrations.

(The NFS mount list is regenerated on the NFS fileserver for our central administrative filesystem, so the update is definitely known to the server once it's finished. Any propagation delays are purely on the side of the NFS clients, who are holding on to some sort of cached information.)

In the past, I've made a couple of attempts to find a way to reliably get the NFS clients to see that there was a new version of the file by doing things like flock(1)'ing it before reading it. These all failed. Recently, one of my co-workers discovered a reliable way of making this work, which was to regenerate the NFS mount list twice instead of once. You didn't have to delay between the two regenerations; running them back to back was fine. At first this struck me as pretty mysterious, but then I came up with a theory for what's probably going on and why this makes sense.

You see, we update this file in a NFS-safe way that leaves the old version of the file around under a different name so that programs on NFS clients that are reading it at the time don't have it yanked out from underneath them. As I understand it, Linux NFS clients cache the mapping from filesystem names to NFS filehandles for some amount of time, to reduce various sorts of NFS lookup traffic (now that I look, there is a discussion pointing to this in the nfs(5) manpage). When we do one regeneration of our nfs-mounts file, the cached filehandle that clients have for that name mapping is still valid (and the file's attributes are basically unchanged); it's just that it's for the file that is now nfs-mounts.bak instead of the new file that is now nfs-mounts. Client kernels are apparently still perfectly happy to use it, and so they read and use the old NFS mount information. However, when we regenerate the file twice, this file is removed outright and the cached filehandle is no longer valid. My theory and assumption is that modern Linux kernels detect this situation and trigger some kind of revalidation that winds up with them looking up and using the correct nfs-mounts file (instead of, say, failing with an error).

(It feels ironic that apparently the way to make this work for us here in our NFS environment is to effectively update the file in an NFS-unsafe way for once.)

PS: All of our NFS clients here are using either Ubuntu 16.04 or 18.04, using their stock (non-HWE) kernels, so various versions of what Ubuntu calls '4.4.0' (16.04) and '4.15.0' (18.04). Your mileage may vary on different kernels and in different Linux environments.

NFSClientFileVisible written at 23:20:58; Add Comment

2019-04-15

How Linux starts non-system software RAID arrays during boot under systemd

In theory, you do not need to care about how your Linux software RAID arrays get assembled and started during boot because it all just works. In practice, sometimes you do, and on a modern systemd-based Linux this seems to be an unusually tangled situation. So here is what I can determine so far about how it works for software RAID arrays that are assembled and started outside of the initramfs, after your system has mounted your real root filesystem and is running from it.

(How things work for starting software RAID arrays in the initramfs is quite varied between Linux distributions. There is some distribution variation even for post-initramfs booting, but these days the master version of mdadm ships canonical udev and systemd scripts, services, and so on and I think most distributions use them almost unchanged.)

As has been the case for some time, the basic work is done through udev rules. On a typical Linux system, the main udev rule file for assembly will be called something like 64-md-raid-assembly.rules and be basically the upstream mdadm version. Udev itself identifies block devices that are potentially Linux RAID members (probably mostly based on the presence of RAID superblocks), and mdadm's udev rules then run mdadm in a special incremental assembly mode on them. To quote the manpage:

This mode is designed to be used in conjunction with a device discovery system. As devices are found in a system, they can be passed to mdadm --incremental to be conditionally added to an appropriate array.

As array components become visible to udev and cause it to run mdadm --incremental on them, mdadm progressively adds them to the array. When the final device is added, mdadm will start the array. This makes the software RAID array and its contents visible to udev and to systemd, where it will be used to satisfy dependencies for things like /etc/fstab mounts and thus trigger them happening.

(There are additional mdadm udev rules for setting up device names, starting mdadm monitoring, and so on. And then there's a whole collection of general udev rules and other activities to do things like read the UUIDs of filesystems from new block devices.)

However, all of this only happens if all of the array component devices show up in udev (and show up fast enough); if only some of the devices show up, the software RAID will be partially assembled by mdadm --incremental but not started because it's not complete. To deal with this situation and eventually start software RAID arrays in degraded mode, mdadm's udev rules start a systemd timer unit when enough of the array is present to let it run degraded, specifically the templated timer unit mdadm-last-resort@.timer (so for md0 the specific unit is mdadm-last-resort@md0.timer). If the RAID array isn't assembled and the timer goes off, it triggers the corresponding templated systemd service unit, using mdadm-last-resort@.service, which runs 'mdadm --run' on your degraded array to start it.

(The timer unit is only started when mdadm's incremental assembly reports back that it's 'unsafe' to assemble the array, as opposed to impossible. Mdadm reports this only once there are enough component devices present to run the array in a degraded mode; how many devices are required (and what devices) depends on the specific RAID level. RAID-1 arrays, for example, only require one component device to be 'unsafe'.)

Because there's an obvious race potential here, the systemd timer and service both work hard to not act if the RAID array is actually present and already started. The timer conflicts with 'sys-devices-virtual-block-<array>.device', the systemd device unit representing the RAID array, and as an extra safety measure the service refuses to run if the RAID array appears to be present in /sys/devices. In addition, the udev rule that triggers systemd starting the timer unit will only act on software RAID devices that appear to belong to this system, either because they're listed in your mdadm.conf or because their home host is this host.

(This is the MD_FOREIGN match in the udev rules. The environment variables come from mdadm's --export option, which is used during udev incremental assembly. Mdadm's code for incremental assembly, which also generates these environment variables, is in Incremental.c. The important enough() function is in util.c.)

As far as I know, none of this is documented or official; it's just how mdadm, udev, and systemd all behave and interact at the moment. However this appears to be pretty stable and long standing, so it's probably going to keep being the case in the future.

PS: As far as I can tell, all of this means that there are no real user-accessible controls for whether or not degraded software RAID arrays are started on boot. If you want to specifically block degraded starts of some RAID arrays, it might work to 'systemctl mask' either or both of the last-resort timer and service unit for the array. If you want to always start degraded arrays, well, the good news is that that's supposed to happen automatically.

SoftwareRaidAssemblySystemd written at 22:37:24; Add Comment

2019-04-13

WireGuard was pleasantly easy to get working behind a NAT (or several)

Normally, my home machine is directly connected to the public Internet by its DSL connection. However, every so often this DSL connection falls over, and these days my backup method of Internet connectivity is that I tether my home machine through my phone. This tethering gives me an indirect Internet connection; my desktop is on a little private network provided by my phone and then my phone NAT's my outgoing traffic. Probably my cellular provider adds another level of NAT as well, and certainly the public IP address that all of my traffic appears from can hop around between random IPs and random networks.

Most of the time this works well enough for basic web browsing and even SSH sessions, but it has two problems when I'm connecting to things at work. The first is that my public IP address can change even while I have a SSH connection present (but perhaps not active enough), which naturally breaks the SSH connection. The second is that I only have 'outside' access to our servers; I can only SSH to or otherwise access machines that are accessible from the Internet, which excludes most of the interesting and important ones.

Up until recently I've just lived with this, because the whole issue just doesn't come up often enough to get me to do anything about it. Then this morning my home DSL connection died at a fairly inopportune time, when I was scheduled to do something from home that involved both access to internal machines and things that very much shouldn't risk having my SSH sessions cut off in mid-flight (and that I couldn't feasibly do from within a screen session, because it involved multiple windows). I emailed a co-worker to have them take over, which they fortunately were able to do, and then I decided to spend a little time to see if I could get my normal WireGuard tunnel up and running over my tethered and NAT'd phone connection, instead of its usual DSL setup. If I could bring up my WireGuard tunnel, I'd have both a stable IP for SSH sessions and access to our internal systems even when I had to use my fallback Internet option.

(I won't necessarily have uninterrupted SSH sessions, because if my phone changed public IPs there will be a pause as WireGuard re-connected and so on. But at least I'll have the chance to have sessions continue afterward, instead of being intrinsically broken.)

Well, the good news is that my WireGuard setup basically just worked as-is when I brought it up behind however many layers of NAT'ing are going on. The actual WireGuard configuration needed no changes and I only had to do some minor tinkering with my setup for policy-based routing (and one of the issues was my own fault). It was sufficiently easy that now I feel a bit silly for having not tried it before now.

(Things would not have been so easy if I'd decided to restrict what IP addresses could talk to WireGuard on my work machine, as I once considered doing.)

This is of course how WireGuard is supposed to work. Provided that you can pass its UDP traffic in both ways (which fortunately seems to work through the NAT'ing involved in my case), WireGuard doesn't care where your traffic comes from if it has the right keys, and your server will automatically update its idea of what (external) IP your client has right now when it gets new traffic, which makes everything work out.

(WireGuard is actually symmetric; either end will update its idea of the other end's IP when it gets appropriate traffic. It's just that under most circumstances your server end rarely changes its outgoing IP.)

I knew that in theory all of this should work, but it's still nice to have it actually work out in practice, especially in a situation with at least one level of NAT going on. I'm actually a little bit amazed that it does work through all of the NAT magic going on, especially since WireGuard is just UDP packets flying back and forth instead of a TCP connection (which any NAT had better be able to handle).

On a side note, although I did everything by hand this morning, in theory I could automate all of this through dhclient hook scripts, which I'm already using to manage my resolv.conf (as covered in this entry). Of course this brings up a little issue, because if the WireGuard tunnel is up and working I actually want to use my regular resolv.conf instead of the one I switch to when I'm tethering (without WireGuard). Probably I'm going to defer all of this until the next time my DSL connection goes down.

WireGuardBehindNAT written at 00:16:23; Add Comment

2019-04-05

I won't be trying out ZFS's new TRIM support for a while

ZFS on Linux's development version has just landed support for using TRIM commands on SSDs in order to keep their performance up as you write more data to them and the SSD thinks it's more and more full; you can see the commit here and there's more discussion in the pull request. This is an exciting development in general, and since ZoL 0.8.0 is in the release candidate stage at the moment, this TRIM support might even make its way into a full release in the not too distant future.

Normally, you might expect me to give this a try, as I have with other new things like sequential scrubs. I've tracked the ZoL development tree on my own machines for years basically without problems, and I definitely have fairly old pools on SSDs that could likely benefit from being TRIM'd. However, I haven't so much as touched the new TRIM support and probably won't for some time.

Some projects have a relatively unstable development tree where running it can routinely or periodically destabilize your environment and expose you to bugs. ZFS on Linux is not like this; historically the code that has landed in the development version has been quite stable and problem free. Code in the ZoL tree is almost always less 'in development' and more 'not in a release yet', partly because ZoL has solid development practices along with significant amounts of automated tests. As you can read in the 'how has this been tested?' section of the pull request, the TRIM code has been carefully exercised both through specific new tests and random invocation of TRIM through other tests.

All of this is true, but then there is the small fact that in practice, ZFS encryption is not ready yet despite having been in the ZoL development tree for some time. This isn't because ZFS encryption is bad code (or untested code); it's because ZFS encryption turns out to be complicated and to interact with lots of other things. The TRIM feature is probably less complicated than encryption, but it's not simple, there are plenty of potential corner cases, and life is complicated by potential issues in how real SSDs do or don't cope well with TRIM commands being issued in the way that ZoL will. Also, an errant TRIM operation inherently destroys some of your data, because that's what TRIM does.

All of this makes me feel that TRIM is inherently much more dangerous than the usual ZoL new feature, sufficiently dangerous that I don't feel confident enough to try it. This time around, I'm going to let other people do the experimentation and collect the arrows in their backs. I will probably only start using ZFS TRIM once it's in a released version and a number of people have used it for a while without explosions.

If you feel experimental despite this, I note that according to the current manpage an explicit 'zpool trim' can apparently be limited to a single disk. I would definitely suggest using it that way (on a pool with redundancy); TRIM a single disk, wait for the disk to settle and finish everything, and then scrub your pool to verify that nothing got damaged in your particular setup. This is definitely how I'm going to start with ZFS TRIM, when I eventually do.

(On my work machine, I'm still tracking the ZoL tree so I'm using a version with TRIM available; I'm just not enabling it. On my home machine, for various reasons, I've currently frozen my ZoL version at a point just before TRIM landed, just in case. I have to admit that stopping updating ZoL does make the usual kernel update dance an easier thing, especially since WireGuard has stopped updating so frequently.)

ZFSNoTrimForMeYet written at 21:19:02; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.