2019-04-24
How we're making updated versions of a file rapidly visible on our Linux NFS clients
Part of our automounter replacement is a file with a master list of all NFS mounts that client machines should have, which we hold in our central administrative filesystem that all clients NFS mount. When we migrate filesystems from our old fileservers to our new fileservers, one of the steps is to regenerate this list with the old filesystem mount not present, then run a mount update on all of the NFS clients to actually unmount the filesystem from the old fileserver. For a long time, we almost always had to wait a bit of time before all of the NFS clients would reliably see the new version of the NFS mounts file, which had the unfortunate effect of slowing down filesystem migrations.
(The NFS mount list is regenerated on the NFS fileserver for our central administrative filesystem, so the update is definitely known to the server once it's finished. Any propagation delays are purely on the side of the NFS clients, who are holding on to some sort of cached information.)
In the past, I've made a couple of attempts to find a way to reliably
get the NFS clients to see that there was a new version of the file
by doing things like flock(1)
'ing it before
reading it. These all failed. Recently, one of my co-workers
discovered a reliable way of making this work, which was to regenerate
the NFS mount list twice instead of once. You didn't have to delay
between the two regenerations; running them back to back was fine.
At first this struck me as pretty mysterious, but then I came up
with a theory for what's probably going on and why this makes sense.
You see, we update this file in a NFS-safe way that leaves the old version
of the file around under a different name so that programs on NFS
clients that are reading it at the time don't have it yanked out
from underneath them.
As I understand it, Linux NFS clients cache the mapping from
filesystem names to NFS filehandles
for some amount of time, to reduce various sorts of NFS lookup
traffic (now that I look, there is a discussion pointing to this
in the nfs(5)
manpage). When we do one
regeneration of our nfs-mounts
file, the cached filehandle that
clients have for that name mapping is still valid (and the file's
attributes are basically unchanged); it's just that it's for the
file that is now nfs-mounts.bak
instead of the new file that is
now nfs-mounts
. Client kernels are apparently still perfectly
happy to use it, and so they read and use the old NFS mount
information. However, when we regenerate the file twice, this file
is removed outright and the cached filehandle is no longer valid.
My theory and assumption is that modern Linux kernels detect this
situation and trigger some kind of revalidation that winds up with
them looking up and using the correct nfs-mounts
file (instead of,
say, failing with an error).
(It feels ironic that apparently the way to make this work for us here in our NFS environment is to effectively update the file in an NFS-unsafe way for once.)
PS: All of our NFS clients here are using either Ubuntu 16.04 or 18.04, using their stock (non-HWE) kernels, so various versions of what Ubuntu calls '4.4.0' (16.04) and '4.15.0' (18.04). Your mileage may vary on different kernels and in different Linux environments.
2019-04-15
How Linux starts non-system software RAID arrays during boot under systemd
In theory, you do not need to care about how your Linux software RAID arrays get assembled and started during boot because it all just works. In practice, sometimes you do, and on a modern systemd-based Linux this seems to be an unusually tangled situation. So here is what I can determine so far about how it works for software RAID arrays that are assembled and started outside of the initramfs, after your system has mounted your real root filesystem and is running from it.
(How things work for starting software RAID arrays in the initramfs is quite varied between Linux distributions. There is some distribution variation even for post-initramfs booting, but these days the master version of mdadm ships canonical udev and systemd scripts, services, and so on and I think most distributions use them almost unchanged.)
As has been the case for some time,
the basic work is done through udev
rules. On a typical Linux
system, the main udev rule file for assembly will be called something
like 64-md-raid-assembly.rules and be basically the upstream
mdadm version.
Udev itself identifies block devices that are potentially Linux
RAID members (probably mostly based on the presence of RAID
superblocks), and mdadm's udev rules then run mdadm
in a special
incremental assembly mode on them. To quote the manpage:
This mode is designed to be used in conjunction with a device discovery system. As devices are found in a system, they can be passed to
mdadm --incremental
to be conditionally added to an appropriate array.
As array components become visible to udev and cause it to run
mdadm --incremental
on them, mdadm
progressively adds them to
the array. When the final device is added, mdadm
will start the
array. This makes the software RAID array and its contents visible to
udev and to systemd, where it will be used to satisfy dependencies for
things like /etc/fstab
mounts and thus trigger them happening.
(There are additional mdadm udev rules for setting up device names, starting mdadm monitoring, and so on. And then there's a whole collection of general udev rules and other activities to do things like read the UUIDs of filesystems from new block devices.)
However, all of this only happens if all of the array component
devices show up in udev (and show up fast enough); if only some of
the devices show up, the software RAID will be partially assembled
by mdadm --incremental
but not started because it's not complete.
To deal with this situation and eventually start software RAID
arrays in degraded mode, mdadm's udev rules start a systemd timer
unit
when enough of the array is present to let it run degraded,
specifically the templated timer unit mdadm-last-resort@.timer
(so for md0 the specific unit is mdadm-last-resort@md0.timer). If
the RAID array isn't assembled and the timer goes off, it triggers
the corresponding templated systemd service unit, using
mdadm-last-resort@.service,
which runs 'mdadm --run
' on your degraded array to start it.
(The timer unit is only started when mdadm's incremental assembly reports back that it's 'unsafe' to assemble the array, as opposed to impossible. Mdadm reports this only once there are enough component devices present to run the array in a degraded mode; how many devices are required (and what devices) depends on the specific RAID level. RAID-1 arrays, for example, only require one component device to be 'unsafe'.)
Because there's an obvious race potential here, the systemd timer
and service both work hard to not act if the RAID array is actually
present and already started. The timer conflicts with
'sys-devices-virtual-block-<array>.device', the systemd device unit
representing the RAID array, and as an extra safety measure the
service refuses to run if the RAID array appears to be present in
/sys/devices
. In addition, the udev rule that triggers systemd
starting the timer unit will only act on software RAID devices that
appear to belong to this system, either because they're listed in
your mdadm.conf
or because their home host is this host.
(This is the MD_FOREIGN
match in the udev rules.
The environment variables come from mdadm's --export
option, which
is used during udev incremental assembly. Mdadm's code for incremental
assembly, which also generates these environment variables, is in
Incremental.c.
The important enough()
function is in util.c.)
As far as I know, none of this is documented or official; it's just how mdadm, udev, and systemd all behave and interact at the moment. However this appears to be pretty stable and long standing, so it's probably going to keep being the case in the future.
PS: As far as I can tell, all of this means that there are no real
user-accessible controls for whether or not degraded software RAID
arrays are started on boot. If you want to specifically block
degraded starts of some RAID arrays, it might work to 'systemctl
mask
' either or both of the last-resort timer and service unit for
the array. If you want to always start degraded arrays, well, the
good news is that that's supposed to happen automatically.
2019-04-13
WireGuard was pleasantly easy to get working behind a NAT (or several)
Normally, my home machine is directly connected to the public Internet by its DSL connection. However, every so often this DSL connection falls over, and these days my backup method of Internet connectivity is that I tether my home machine through my phone. This tethering gives me an indirect Internet connection; my desktop is on a little private network provided by my phone and then my phone NAT's my outgoing traffic. Probably my cellular provider adds another level of NAT as well, and certainly the public IP address that all of my traffic appears from can hop around between random IPs and random networks.
Most of the time this works well enough for basic web browsing and even SSH sessions, but it has two problems when I'm connecting to things at work. The first is that my public IP address can change even while I have a SSH connection present (but perhaps not active enough), which naturally breaks the SSH connection. The second is that I only have 'outside' access to our servers; I can only SSH to or otherwise access machines that are accessible from the Internet, which excludes most of the interesting and important ones.
Up until recently I've just lived with this, because the whole issue just doesn't come up often enough to get me to do anything about it. Then this morning my home DSL connection died at a fairly inopportune time, when I was scheduled to do something from home that involved both access to internal machines and things that very much shouldn't risk having my SSH sessions cut off in mid-flight (and that I couldn't feasibly do from within a screen session, because it involved multiple windows). I emailed a co-worker to have them take over, which they fortunately were able to do, and then I decided to spend a little time to see if I could get my normal WireGuard tunnel up and running over my tethered and NAT'd phone connection, instead of its usual DSL setup. If I could bring up my WireGuard tunnel, I'd have both a stable IP for SSH sessions and access to our internal systems even when I had to use my fallback Internet option.
(I won't necessarily have uninterrupted SSH sessions, because if my phone changed public IPs there will be a pause as WireGuard re-connected and so on. But at least I'll have the chance to have sessions continue afterward, instead of being intrinsically broken.)
Well, the good news is that my WireGuard setup basically just worked as-is when I brought it up behind however many layers of NAT'ing are going on. The actual WireGuard configuration needed no changes and I only had to do some minor tinkering with my setup for policy-based routing (and one of the issues was my own fault). It was sufficiently easy that now I feel a bit silly for having not tried it before now.
(Things would not have been so easy if I'd decided to restrict what IP addresses could talk to WireGuard on my work machine, as I once considered doing.)
This is of course how WireGuard is supposed to work. Provided that you can pass its UDP traffic in both ways (which fortunately seems to work through the NAT'ing involved in my case), WireGuard doesn't care where your traffic comes from if it has the right keys, and your server will automatically update its idea of what (external) IP your client has right now when it gets new traffic, which makes everything work out.
(WireGuard is actually symmetric; either end will update its idea of the other end's IP when it gets appropriate traffic. It's just that under most circumstances your server end rarely changes its outgoing IP.)
I knew that in theory all of this should work, but it's still nice to have it actually work out in practice, especially in a situation with at least one level of NAT going on. I'm actually a little bit amazed that it does work through all of the NAT magic going on, especially since WireGuard is just UDP packets flying back and forth instead of a TCP connection (which any NAT had better be able to handle).
On a side note, although I did everything by hand this morning, in
theory I could automate all of this through dhclient
hook scripts, which I'm
already using to manage my resolv.conf (as covered in this entry). Of course this brings up a little issue,
because if the WireGuard tunnel is up and working I actually want
to use my regular resolv.conf instead of the one I switch to when
I'm tethering (without WireGuard). Probably I'm going to defer all
of this until the next time my DSL connection goes down.
2019-04-05
I won't be trying out ZFS's new TRIM support for a while
ZFS on Linux's development version has
just landed support for using TRIM
commands on SSDs in order
to keep their performance up as you write more data to them and the
SSD thinks it's more and more full; you can see the commit here
and there's more discussion in the pull request. This is an exciting
development in general, and since ZoL 0.8.0 is in the release
candidate stage at the moment, this TRIM support might even make
its way into a full release in the not too distant future.
Normally, you might expect me to give this a try, as I have with other new things like sequential scrubs. I've tracked the ZoL development tree on my own machines for years basically without problems, and I definitely have fairly old pools on SSDs that could likely benefit from being TRIM'd. However, I haven't so much as touched the new TRIM support and probably won't for some time.
Some projects have a relatively unstable development tree where running it can routinely or periodically destabilize your environment and expose you to bugs. ZFS on Linux is not like this; historically the code that has landed in the development version has been quite stable and problem free. Code in the ZoL tree is almost always less 'in development' and more 'not in a release yet', partly because ZoL has solid development practices along with significant amounts of automated tests. As you can read in the 'how has this been tested?' section of the pull request, the TRIM code has been carefully exercised both through specific new tests and random invocation of TRIM through other tests.
All of this is true, but then there is the small fact that in practice, ZFS encryption is not ready yet despite having been in the ZoL development tree for some time. This isn't because ZFS encryption is bad code (or untested code); it's because ZFS encryption turns out to be complicated and to interact with lots of other things. The TRIM feature is probably less complicated than encryption, but it's not simple, there are plenty of potential corner cases, and life is complicated by potential issues in how real SSDs do or don't cope well with TRIM commands being issued in the way that ZoL will. Also, an errant TRIM operation inherently destroys some of your data, because that's what TRIM does.
All of this makes me feel that TRIM is inherently much more dangerous than the usual ZoL new feature, sufficiently dangerous that I don't feel confident enough to try it. This time around, I'm going to let other people do the experimentation and collect the arrows in their backs. I will probably only start using ZFS TRIM once it's in a released version and a number of people have used it for a while without explosions.
If you feel experimental despite this, I note that according to
the current manpage
an explicit 'zpool trim
' can apparently be limited to a single
disk. I would definitely suggest using it that way (on a pool with
redundancy); TRIM a single disk, wait for the disk to settle and
finish everything, and then scrub your pool to verify that nothing
got damaged in your particular setup. This is definitely how I'm
going to start with ZFS TRIM, when I eventually do.
(On my work machine, I'm still tracking the ZoL tree so I'm using a version with TRIM available; I'm just not enabling it. On my home machine, for various reasons, I've currently frozen my ZoL version at a point just before TRIM landed, just in case. I have to admit that stopping updating ZoL does make the usual kernel update dance an easier thing, especially since WireGuard has stopped updating so frequently.)