Wandering Thoughts

2018-08-10

Fetching really new Fedora packages with Bodhi

Normal Fedora updates that have been fully released are available through the regular updates repository, which is (or should be) already configured into dnf on your Fedora system. More recent (and less well tested) updates are available through the updates-testing repository, which you can selectively enable in order to see if what you're looking for is there. Right now I'm interested in Rust 1.28, because it's now required to build the latest Firefox from source, so:

# dnf --enablerepo=updates-testing check-update 'rust*'
Last metadata expiration check: 0:00:56 ago on Fri 10 Aug 2018 02:12:32 PM EDT.
#

However sometimes, as in this case and past ones, any update that actually exists is too new to even have made it into the updates-testing DNF repo. Fedora does their packaging stuff through Fedora Bodhi (see also), and as part of this packages can be built and available in Bodhi even before they're pushed to updates-testing, so if you want the very freshest bits you want to check in Bodhi.

There are two ways to check Bodhi; through the command line using the bodhi client (which comes from the bodhi-client package), or through the website. Perhaps I should use the client all the time, but I tend to reach for the website as my first check. The URL for a specific package on the website is of the form:

https://bodhi.fedoraproject.org/updates/?packages=<source package>

For example, https://bodhi.fedoraproject.org/updates/?packages=rust is the URL for Rust (and there's a RSS feed if you care a lot about a particular package). For casual use, it's probably easier to just search from Bodhi's main page.

Through the command line, checking for and downloading an update looks like this:

; bodhi updates query --packages rust --releases f28 --status pending
============================= [...]
     rust-1.28.0-2.fc28
============================= [...]
   Update ID: FEDORA-2018-42024244f2
[...]
       Notes: New versions of Rust and related tools -- see the release notes
            : for [1.28](https://blog.rust-lang.org/2018/08/02/Rust-1.28.html).
   Submitter: jistone
   Submitted: 2018-08-10 14:35:56
[...]

We insist on the pending status because that cuts the listing down and normally gives us only one package, where we get to see detailed information about it; I believe that there's normally only one package in pending status for a particular Fedora release. If there's multiple ones, you get a less helpful summary listing that will give you only the full package name instead of the update ID. If you can't get the update ID through bodhi, you can always get it through the website by clicking on the link to the specific package version on the package's page.

To fetch all of the binary RPMs for an update:

; cd /tmp/scratch
; bodhi updates download --updateid FEDORA-2018-42024244f2
[...]

Or:

; cd /tmp/scratch
; bodhi updates download --builds rust-1.28.0-2.fc28
[...]

Both versions of the bodhi command download things to the current directory, which is why I change to a scratch directory first. Then you can do 'dnf update /tmp/scratch/*.rpm'. If the resulting packages work and you feel like it, you can leave feedback on the Bodhi page for the package, which may help get it released into the updates-testing repo and then eventually the updates repo.

(In theory you can leave feedback through the bodhi command too, but it requires more setup and I think has somewhat less options than the website.)

As far as I've seen, installing RPMs this way will cause things to remember that you installed them by hand, even when they later become available through the updates-testing or the updates repo. This is probably not important to you.

(I decided I wanted an actual entry on this process that I can find easily later, instead of having to hunt around for my postscript in this entry the next time I need it.)

PS: For my future use, here is the Bodhi link for the kernel, which is probably the package I'm most likely to want to fish out of Bodhi regularly. And just in case, openssl and OpenSSH.

FedoraBodhiGetPackages written at 14:58:56; Add Comment

2018-08-08

Systemd's DynamicUser feature is (currently) dangerous

Yesterday I described how timesynd couldn't be restarted on one of our Ubuntu 18.04 machines, where the specific thing that caused the failure was timesyncd attempting to access /var/lib/private/systemd/timesync and failing because /var/lib/private is only accessible by root, not the UID that timesyncd was running as. My diagnostic efforts left me puzzled as to how this was supposed to work at all, but Trent Lloyd (@lathiat) pointed me to the answer, which is in Lennart Poettering's article Dynamic Users with systemd, which introduces the overall system, explains the role of /var/lib/private, and covers how timesyncd is supposed to get access through an inaccessible directory. I'll quote the explanation for that:

[Access through /var/lib/private] is achieved by invoking the service process in a slightly modified mount name-space: it will see most of the file hierarchy the same way as everything else on the system ([...]), except for /var/lib/private, which is over-mounted with a read-only tmpfs file system instance, with a slightly more liberal access mode permitting the service read access. [...]

Since timesyncd is not able to get access through /var/lib/private, you might guess that something has gone wrong in the process of setting up this slightly modified mount namespace. Indeed this turned out to be the case. The machine that this happened on is an NFS client and (as is usual) its UID 0 is mapped to an unprivileged UID on our fileservers. On this machine there were some FUSE mounts in the home directories of users who have their $HOME not world readable (our default $HOME permissions are owner-only, to avoid accidents). When systemd was setting up the 'slightly modified mount name-space' it attempted to access these FUSE mounts as part of binding them into the namespace, but it failed because UID 0 had no permissions to look inside user home directories.

This failure caused systemd to give up attempting to set up the namespace. However, systemd did not abort unit activation or even log an error message. Instead it continued on to try to start timesyncd without this special namespace, despite the fact that timesyncd uses both DynamicUser and StateDirectory and so starting it normally was essentially absolutely guaranteed to fail.

(Although my initial case was dangling FUSE mounts, it soon developed that any FUSE mounts would do it, for example a sshfs or smbfs mount in a user's NFS mounted home directory when the home directory isn't world-accessible.)

Systemd's failure to handle errors in setting up the namespace here has been raised as systemd issue 9835. However, merely logging an error or aborting the unit activation would not actually fix the core problem; it would merely let you see exactly why your timesyncd or whatever service is failing to start. The core problem is that systemd's current design for DynamicUser intrinsically blows up if systemd and UID 0 don't have full access to every mount that's visible on the system.

(Well, DynamicUser plus StateDirectory, but the idea seems to be that pretty much every service using dynamic users will have a systemd managed state directory.)

In my opinion, this makes using DynamicUser surprisingly dangerous. A systemd service that is set to use it can't be reliably started or restarted on all systems; it only works on some systems, some of the time (but those happen to be the common case). If there's ever a problem setting up the special namespace that each such service requires, things fail. Machines that are NFS clients are the obvious case, since the client's UID 0 often has limited privileges, but I believe that there are likely to be others.

(And of course services can be restarted for random and somewhat unpredictable reasons, such as package updates or other services being restarted. You should not assume that you can always control these circumstances, or completely predict the state of the system when they happen.)

SystemdDynamicUserDangerous written at 21:51:36; Add Comment

A timesyncd total failure and systemd's complete lack of debugability

Last November, I wrote an entry about how we were switching to using systemd's timesyncd on our Ubuntu machines. Ubuntu 18.04 defaults to using timesyncd just as 16.04 does, and when we set up our standard Ubuntu 18.04 environment we stuck with that default behavior (although we customize the list of NTP servers). Then today I discovered that timesyncd had silently died on one of our 18.04 servers back on July 20th, and worse it couldn't be restarted.

Specifically, it reported:

systemd-timesyncd[10940]: Failed to create state directory: Permission denied

The state directory it's complaining about is /var/lib/systemd/timesync, which is actually a symlink to /var/lib/private/systemd/timesync (at least on systems that are in good order; if the symlink has had something happen to it, you can apparently get other errors from timesyncd). I had a clever informed theory about what was wrong with things, but it turns out strace says I'm wrong.

(To my surprise, doing 'strace -f -p 1' on this system did not produce either explosions or an impossibly large amount of output. This would have been a very different thing on a system that was actually in use; this is basically an almost idle server being used as part of our testing of 18.04 before we upgrade our production servers to it.)

According to strace, what is failing is timesyncd's attempts to access /var/lib/private/systemd/timesync as its special UID (and GID) 'systemd-timesync'. This is failing for the prosaic reason that /var/lib/private is owner-only and owned by root. Since this works on all of our other Ubuntu 18.04 machines, presumably the actual failure is somewhere else.

The real problem here is that it is impossible to diagnose or debug this situation. Simply to get this far I had to read the systemd source code (to find the code in timesyncd that printed this specific error message) and then search through 25,000 lines of strace output. And I still don't know what the problem is or how to fix it. I'm not even confident that rebooting the server will change anything, especially when all the relevant pieces on this server seem to be just the same as the pieces on other, working servers.

(I do know that according to logs this failure started happening immediately after the systemd package was upgraded and re-executed itself. On the other hand, the systemd upgrade also happened on other Ubuntu 18.04 machines, and they didn't have their timesyncds explode.)

Since systemd has no clear diagnostic information here, I spent a great deal of time chasing the red herring that if you look at /var/lib/private/systemd/timesync on such a failing system, it will be owned by a numeric UID and GID, while on working systems it will be the magically special login and group 'systemd-timesync'. This is systemd's 'dynamic user' facility in action, combined with systemd itself creating the /var/lib/private/systemd/timesync directory (with the right login and group) before exec'ing the timesyncd binary. When timesyncd fails to start, systemd removes the login and group but leaves the directory behind, now not owned by any existing login or group.

(You might think that the 'failed to create state directory' error message would mean that timesyncd was the one actually creating the state directory, but strace says otherwise; the mkdir() happens before the exec() does, while the new process that will become timesyncd is still in systemd's code. timesyncd's code does try to create the directory, but presumably the internal systemd functions it's using are fine if the directory is already there with the right ownership and so on.)

I am rather unhappy about this situation, and I am even unhappier that there is effectively nothing that we can do about any aspect of it except to stop using timesyncd (which is now something that I will be arguing for, especially since this server drifted more than half a second out of synchronization before I found this issue entirely by coincidence). Reporting a bug to either systemd or to Ubuntu is hopeless (systemd will tell me to reproduce on the latest version, Ubuntu will ignore it as always). This is simply what happens when the systemd developers produce a design and an implementation that doesn't explain how it actually works and doesn't contain any real support for field diagnosis. Once again we get to return to the era of 'reboot the server, maybe that will fix it'. Given systemd's general current attitude, I don't expect this to change any time soon. Adding documentation of systemd's internals and diagnosis probes would be admitting that the internals can have bugs, problems, and issues, and that's just not supposed to happen.

PS: The extra stupid thing about the whole situation is that the only thing /var/lib/systemd/timesync is used for is to hold a zero-length file whose timestamp is used to track the last time the clock was synchronized, and non-root users can't even see this file on Ubuntu 18.04.

Update: I've identified the cause of this problem, which is described in my new entry on how systemd's DynamicUser feature is dangerous. The short version is that systemd silently failed to set up a custom namespace that would have given timesyncd access to /var/lib/private because it could not deal with FUSE mounts in NFS mounted user home directories that were not world-accessible.

SystemdTimesyncdFailure written at 01:52:59; Add Comment

2018-08-06

Linux's /dev/disk/by-path names for disks change over time

I have in the past written about the many names of SATA disks and on the names of SAS drives, and in both cases one of the sorts of names I talked about was the /dev/disk/by-path names. Unlike the various other names of disks, which are generally kernel based, these names come from the inscrutable depths of udev. It will probably not surprise you to hear that udev periodically changes its mind about what to call things (or, sometimes, has problems figuring things out).

Due to our new fileserver hardware, I can give you two examples of how this has changed, one for SAS devices and one for SATA ones. First, for SATA disks that are directly attached to SAS ports, udev now provides disk names that use the SAS PHY number instead of the nominal 'SAS address', resulting in names like pci-0000:19:00.0-sas-phy2-lun-0. There is still a /sys/block/sdX/device/sas_address file, I believe with the same contents as before, it's just that udev now just uses the PHY number. This is convenient for us, since SAS PHY numbers seem to be the best way of identifying the physical disk slot on our hardware. Udev's SAS PHY numbers start from 0.

For SATA disks that are directly attached to SATA ports, udev now uses names that directly refer to the ataN names of the drives (at least for drives that aren't behind port multipliers; udev probably still mangles the names of SATA disks behind port multipliers). This gives you names such as pci-0000:00:17.0-ata-2. Much like the kernel, udev's ATA numbers start from one, and they're relative to the controller; our new systems have both pci-0000:00:11.5-ata-1 and pci-0000:00:17.0-ata-1 disks.

(This switch may be partly due to ATA numbers now appearing in sysfs, as very helpfully noted by Georg Sauthoff in a comment from last year on my old entry. This sysfs change happened sometime between CentOS 6's kernel (some version of 2.6.32) and CentOS 7's kernel (some version of 3.10.0).)

Notice that udev is not necessarily consistent with itself in naming standards. Directly connected SATA disks use 'ata-N', with a dash between the fixed name and the number, while SAS disks use 'phyN', with no dash. I suspect that different people write the code for different sorts of devices, and do whatever they feel is the best option.

(I believe that all of these names are hard-coded in udev itself, not set up through udev rules.)

Generally any competently run Linux distribution is not going to cause your /dev/disk/by-path names to change over the lifetime of any particular release of the distribution. They may well change from release to release, though, especially for major jumps (for example, between Ubuntu LTS releases). This is a potential issue if you have things that use these names and rely on them staying constant. One possible case is ZFS on Linux, especially given how it handles disk names; however, the usual recommendation for ZoL is to use /dev/disk/by-id names, which should really be stable over the long term.

(I don't know if they actually have been, although my ZoL pools haven't suffered any explosions due to this over the several years and fair number of Fedora releases that I've been running them.)

PS: To my surprise, none of our Ubuntu 14.04 systems even have a /dev/disk/by-path directory. I suspect that this is some 14.04 peculiarity, since CentOS 6 is even older and does have a by-path directory, and this old entry says that at least some of our 12.04 systems also had it. We don't normally use any of the /dev/disk/by-* directories on our regular Ubuntu servers, which is probably why I didn't notice before now.

LinuxDiskNamesChange written at 21:21:46; Add Comment

2018-08-02

Ubuntu 18.04's problem with Amanda's amrecover

If you use Amanda to back up your machines (as we do), and you have just added some Ubuntu 18.04 LTS machines to your fleet and installed the usual amanda-client Ubuntu package to get the necessary client programs, you may some day fire up amrecover on one of them to restore some of those backups. Well, to attempt to restore those backups:

# amrecover -s <server> -t <server> -C <s_config>
AMRECOVER Version 3.5.1. Contacting server on <server> ...
[request failed: amrecover: error [exec /usr/lib/amanda/ambind: No such file or directory]]

Our Amanda servers are running Ubuntu 16.04 LTS, with Amanda 3.3.6. Given this error message (and also the fact that amrecover generally takes several seconds to produce it), we concluded that the 3.5.1 amrecover now requires the Amanda server to have this new ambind program (which only appeared in 3.5). This seemed about par for the course for Ubuntu in 18.04, given issues like libreadline6.

This turns out not to be the case (to my disgusted surprise). Despite how the error message looks, it's the Amanda client (the 18.04 machine) that needs ambind, not the server; amrecover itself is trying to directly execute ambind and failing because indeed ambind's not there. The reason that it's not there is that Ubuntu put ambind into the amanda-server package instead of either amanda-client (which would be appropriate if it's only needed by amrecover) or amanda-common (if it's also needed by Amanda server programs). You probably haven't installed the amanda-server package on your Amanda client machines because, really, why would you?

The good news is that this is easily fixed. Just install amanda-server as well as amanda-client on all of your Ubuntu 18.04 Amanda clients, and everything should be fine. As far as I can tell, installing the server package doesn't do anything dangerous like enable services; it just adds some more programs and manpages.

This packaging issue appears to be inherited from Debian, where the current 'buster (testing)' packages of 3.5.1 also put ambind in the amanda-server package. However, Debian testing is the rolling 'latest development state' of Debian, not shipping as an official LTS release the way Ubuntu 18.04 is.

PS: This is a terrible error message from amrecover, especially under the circumstances. If your program talks to a server, you should always make it completely unambiguous about when you're reporting a local error compared to when you're just relaying an error from the server. If there is any chance of confusion in your error messages, you're doing it wrong.

Sidebar: How I worked this out (mostly grim determination and flailing)

We thought we had a workaround in the form of hacking up the Ubuntu 16.04 Amanda 3.3.6 packages and installing them on 18.04, but then we started to run into various troublesome issues and I decided to see if there was some way of turning off this 'invoke ambind on the server' behavior with an Amanda configuration setting or amrecover command line option. So I went off to look at just what was happening on the Amanda server.

I started by looking at the Amanda server logs. Well, trying to look at them, because there was absolutely nothing being logged about this (which is unusual, the Amanda server stuff is usually quite verbose). My next step was to get out the big hammer and run 'strace -f -e trace=file -o /tmp/st-out -p <xinetd's PID>' on the Amanda server while I invoked amrecover on the client. This too was completely empty, so I spent a while wondering if there was some security setting that was making the strace not work.

Interspersed with trying to trace the server's actions I was also reading through the Amanda source code to try to follow the control flow that sent a message from the client to invoke ambind on the server. The problem was that I couldn't really find anything that looked like this; the only use of ambind I could see seemed entirely inside one file, not the back and forth exchange to the Amanda server stuff that I'd expect. However, I could find something that looked a lot like the 'error [exec ...]' error message that was ultimately being printed out.

All of this led me to run strace on amrecover itself, and lo and behold there was the smoking gun:

20360 execve("/usr/lib/amanda/ambind", ["/usr/lib/amanda/ambind", "5"], 0x7ffe980502e8 /* 113 vars */) = -1 ENOENT (No such file or directory)

Then it was just a matter of using packages.ubuntu.com to verify that ambind was in the amanda-server package and some testing to verify that installing it on an 18.04 test machine appeared to make amrecover happy with life.

Ubuntu1804AmandaProblem written at 00:43:48; Add Comment

2018-07-22

The problem with some non-HiDPI aware applications (is that they're very small)

I tweeted:

One problem with a HiDPI monitor is the occasional application that absolutely doesn't upscale. For example, the Java that I need to access the KVM-over-IP console of this locked up NFS fileserver.

Our current fileservers are old enough Supermicro machines that their onboard IPMIs only support KVM-over-IP through a Java Web Start application. Today, I needed to use it from home, and I was only a little bit surprised when the resulting virtual console was, well, tiny on my new home HiDPI display.

(I was a bit surprised at how visually tiny it came out, but then I keep being surprised at how small 'half their normal size' has actually been on my new display. And the virtual console wasn't really a giant window to start with on my non-HiDPI work displays, at least in the basic 'text' VGA mode.)

Some modern applications are HiDPI aware, and others at least provide settings for the fonts and font sizes that they use. It's possible that Supermicro's Java program has settings for this (I was in a hurry so I didn't look, although here's Arch's Java information), but I have a sneaking suspicion that it doesn't. For applications like this, the end result is tiny, hard to read or use application windows, either permanently or until I can find how to adjust the application (which may not be worth it if I only use the app occasionally). Since I'm likely to run into this periodically, I should work out a decent general solution to it someday.

In the bright future of Wayland, it will presumably be theoretically possible to have your Wayland compositor automatically scale windows according to your desires, so old non-HiDPI X applications being run in some sort of compatibility mode can just be zoomed up however much you want (since all of this is OpenGL based, and my understanding is that OpenGL has good support for that kind of thing). In the current reality of X Windows (at least, it's my current reality and hopefully future reality), I need a different solution. To date, I know of two.

The easiest option and one that's probably already available, even in basic X environments like mine, is a screen magnifier such as KMag. KMag has the moderate inconvenience that it can't be told to magnify a given (X) window, although you can awkwardly set it to magnify an area of the screen instead of wherever your mouse cursor is. Since I already have KMag installed, this is probably my default choice.

Other than that, the Arch wiki has a section on unsupported apps which led me to run_scaled, a shell script that uses xpra to relatively transparently run programs with forced scaling. Run_scaled is a pretty big hammer and it has some drawbacks, partly because your program is running in a separate X server. I could probably make it work for Java Web Start stuff with some effort, but it's more awkward than just resorting to KMag; I'd need to get my browser to run my javaws cover script instead of the real javaws.

(Fedora is already using its alternatives system to pick who gets to be /usr/bin/javaws, so in theory I could just set my script up as that and then pass things off to the IcedTea version.)

(I initially thought of playing tricks with Xephyr plus xrandr for scaling the 'display' inside Xephyr, but the more I think about it the less sense that approach makes. I think I'd be better off using run_scaled.)

HiDPITinyAppProblem written at 03:01:52; Add Comment

2018-07-21

How we're handling NFS exports for our ZFS on Linux systems

If you have ZFS based NFS fileservers you're normally supposed to handle setting up NFS export/sharing permissions through ZFS by setting and updating the sharenfs property on ZFS filesystems. ZFS then worries about keeping the system's NFS export permissions in sync with what (ZFS) filesystems you have mounted, where you have them mounted, and what their sharenfs settings are. There are all sort of convenient aspects of this and it's what we've done for years on our current fileservers. Unfortunately this is not an option for us in ZFS on Linux. I sort of covered why in my entry on ZoL's sharenfs problem, but I didn't mention the core issue for us, which is that ZoL's handling of sharenfs has no support for the Illumos 'root=' option to provide root access to the NFS filesystem for only certain systems (instead of all of them). In that entry I speculated that we'd embed our NFS export options as a ZFS user property on ZFS filesystems. This is sort of the intellectually pure option, but we've decided to take another way. We're going to be managing our NFS export permissions entirely outside of ZFS, but reusing 'sanfs', our existing local filesystem management program.

Sanfs's job is to set up and operate filesystems according to our local policies and specifications; it handles things like filesystem quotas and reservations, knows whether the filesystem should be visible on our deliberately restricted web server, and so on. Since its configuration file is the central point that knows about all of our NFS-visible filesystems, we also use it to automatically generate the NFS mount list for our local automounter. The sanfs configuration file is where we specify (Illumos) NFS export options, including any special additions for particular filesystems:

# Global default
shareopts  nosuid,sec=sys,rw=nfs_ssh,root=nfs_root

# Individual filesystems
fs3 /h/281 fs3-staff-01   rw+=cks_dev

(The += syntax is something that I'm unreasonably happy about; it exactly captures the change we almost always want to make to NFS export permissions.)

In the Linux ZoL world, the Linux version of sanfs will still use the same configuration file and the same format for NFS export permissions, but instead of just setting the sharenfs ZFS property with the final calculated share options, it will convert them over to Linux NFS export permissions (using some local knowledge and the general equivalences) and then directly manage Linux NFS export permissions using an auxiliary script. This script does two things. First, it writes or updates a per-filesystem file in /etc/exports.d that records the current permissions, and then it pokes exportfs to update the actual live permissions to reflect their new state. Among other reasons, recording the state of things in /etc/exports.d makes our NFS export permissions automatically persist over reboots.

(All our NFS exports will use the mountpoint option, so they're not active until and unless the ZFS filesystem is mounted.)

One significant part of what makes this work is that we never actually use any of the convenient things that ZFS's handling of sharenfs gives you. We always export ZFS filesystems individually, we never move them around, and we don't export and import pools (at least not without explicitly unmounting things on clients, for good reasons). Without a SAN we definitely can't ever move a pool between physical machines without a lot of intervention. Basically, once pools and filesystems are created, they stay there more or less forever.

PS: The Linux version of sanfs and the current Illumos version will in fact literally be using the same configuration file, since we're inevitably going to be operating both at once for a while. We have terabytes of data to move across in a couple hundred filesystems, and that's not exactly going to happen fast, especially when we haven't even finished developing the Linux fileservers.

ZFSOnLinuxNFSExportsSolution written at 00:28:33; Add Comment

2018-07-20

Linux's NFS exports permissions model compared to Illumos's

As part of our move from OmniOS based fileservers to ones based on Linux, I've recently been looking into how to map our current NFS export permissions into Linux's NFS export permissions. As part of this I've been looking into the similarities and differences between the Linux model of NFS export permissions and the Illumos one. The end results you can get are mostly similar (with one difference that may matter for us someday), but Linux gets there in a significantly different way.

To simplify a bit, in Illumos you have permissions that apply to things, such as netgroups. If a host would match multiple things, whichever read or read/write permission is listed first takes priority (more or less). If you write 'rw=...,ro=...', rw permissions take priority for any host in both. In Linux, this is inverted; you have things (aka NFS clients), such as netgroups, that have permissions and other options specified for them. If a host would match multiple netgroups, the first matching one wins and specifies all of the host's permissions and options. This can duplicate the Illumos read versus read/write behavior but it gives you more flexibility in general. However, it's more verbose if you have several netgroups.

To see this extra verbosity, consider an Illumos share of 'rw=A:B:C,ro=D:E', where all of these are netgroups. In Linux, you turn this inside out and wind up writing:

@A(rw,...) @B(rw,...) @C(rw,...) @D(ro,...) @E(ro,...)

As far as I know Linux has no way to specify 'any of these N netgroups' in a single match, so you have to have a separate entry for each netgroup. If you do this a lot you presumably create yourself a superset netgroup, but that doesn't necessarily scale if you're doing this on an ad-hoc basis with various different shares, as we are.

The one place where Illumos and Linux are different in an important way for us is remapping or not remapping UID 0. Illumos supports a 'root=' option, where hosts specified in it don't remap UID 0, and this is applied to them separately from whether they have read or read/write permissions. In Linux, UID 0's mapping (or lack of it) is part of a NFS client's options, and so it must be specified together with whether the client has read or read/write permission. This makes it impossible to translate some Illumos root= settings without changing your netgroups and makes translating others require local knowledge (for example, of what netgroups are a subset of what other ones).

(Linux is more flexible here in some ways, but you have to want to map UID 0 to different UIDs for different clients.)

We're fortunately not doing anything tricky with our Illumos root= permissions; the machines that we give root access to are always a subset of the machines that we give read/write access to. With this local knowledge in hand, it's easy (but verbose) to automatically translate Illumos ZFS sharenfs settings to equivalent Linux ones, although we can't manage them through ZFS on Linux's sharenfs property.

PS: The Linux NFS(v3) server doesn't support the sort of general UID and GID remapping that Illumos does; it only remaps UID 0. This fortunately doesn't matter for us in general, although it's very slightly inconvenient for me.

PPS: For NFS exporting ZFS filesystems specifically, you probably want to include the Linux crossmnt share option because, if I'm reading the tea leaves correctly, it allows NFS clients to have access to the filesystem's .zfs/snapshot pseudo-directory of ZFS snapshots, which are independent sub-filesystems. This is automatic on Illumos.

NFSExportPermsModel written at 01:03:56; Add Comment

2018-07-19

Sometimes it actually is a Linux kernel bug

For my sins, I use a number of third party 'out of kernel' kernel modules in my Fedora kernels, especially on my office workstation. I don't use a binary GPU driver, but there's the latest git tip of ZFS on Linux, VMWare's kernel modules, and an out of tree it87 module in order to support my motherboard's sensors (for as long as that keeps working). Usually this works fine. Usually. For the past few days, my office machine has been panicing during our nightly backups when Amanda runs a big tar over some of my filesystems.

There's a long standing saying in programming that 'it's never a compiler bug'. I have a similar rule of thumb about kernel panics; given that I use a number of third party modules, especially the VMWare modules, any kernel panics I run into are caused by them.

(I mean, apart from the system lockups, which are AMD's fault, and the amdgpu problems, which was a graphics driver issue. It's a rule of thumb, and it's mostly true about core kernel code. Linux kernel driver code is a little bit more likely to have bugs.)

So I assumed that the cause of my sudden panics was probably ZFS on Linux (an assumption helped along when I accidentally ran my machine without the VMWare modules and it still paniced). After some diagnostic work, I reduced things down to a belief of 'ZoL and the latest Fedora kernels don't like each other', went to report an issue, and found ZoL issue #7723 and thus Fedora #1598462. Which led to my tweet:

Today I learned that Fedora 27 and 28 kernels after 4.17.3 are known to panic under high IO load. Better late than never, but I could have used that knowledge before upgrading the office machine to 4.17.5.

So, yeah. To my surprise, this actually is a (general) Linux kernel bug, not any of the third party modules it happens. This feels like the equivalent of finding a genuine compiler bug.

(I can be pretty sure that I'm hitting the same bug, because I have a basic netconsole (also) setup, and my panic messages match the bug report's. They also run through ZFS functions, which didn't help my initial suspicions.)

PS: What made this more peculiar is that I've been running the Fedora 27 4.17.5 kernel at home without problems. But then, I don't have good home backups and I don't think I've done anything else recently to stress the home machine's IO. I should revert back to 4.17.3 anyway.

PPS: 4.17.7 kernels are in Fedora Bodhi but apparently not yet in the updates-testing DNF repo for Fedora 27. It looks like the most convenient way to get things from Bodhi is with the bodhi client program. I'm using it like so:

bodhi updates query --packages kernel --releases f27 --status pending
bodhi updates download --builds kernel-4.17.7-100.fc27

(You could leave out the '--status pending', but if you do you get a big list of past updates that aren't very interesting. If I'm fetching something from Bodhi it's because I can't get it anywhere else, so it's probably an update that's so new that it's not even in the updates-testing repo.)

KernelBugSometimes written at 00:26:59; Add Comment

2018-07-13

ZFS on Linux's sharenfs problem (of what you can and can't put in it)

ZFS has an idea of 'properties' for both pools and filesystems. To quote from the Illumos zfs manpage:

Properties are divided into two types, native properties and user-defined (or "user") properties. Native properties either export internal statistics or control ZFS behavior. [...]

Filesystem properties are used to control things like whether compression is on, where a ZFS filesystem is mounted, if it is read-only or not, and so on. One of those properties is called sharenfs; it controls whether or not the filesystem is NFS exported, and what options it's exported with. One of the advantages of having ZFS manage this for you through the sharenfs property is that ZFS will automatically share and unshare things as the ZFS pool and filesystem are available or not available; you don't have to try to coordinate the state of your NFS shares and your ZFS filesystem mounts.

As I write this, the current ZFS on Linux zfs manpage says this about sharenfs:

Controls whether the file system is shared via NFS, and what options are to be used. [...] If the property is set to on, the dataset is shared using the default options:

sec=sys,rw,crossmnt,no_subtree_check,no_root_squash

See exports(5) for the meaning of the default options. Otherwise, the exportfs(8) command is invoked with options equivalent to the contents of this property.

That's very interesting wording. It's also kind of a lie, because ZFS on Linux caught itself in a compatibility bear trap (or so I assume).

This wording is essentially the same as the wording in Illumos (and in the original Solaris manpages). On Solaris, the sharenfs property is passed more or less straight to share_nfs as the NFS share options in its -o argument, and as a result what you put in sharenfs is just those options. This makes sense; the original Solaris version of ZFS was not created to be portable to other Unixes, so it made no attempt to have its sharenfs (or sharesmb) be Unix-independent. It was part of Solaris, so what went into sharenfs was Solaris NFS share options, including obscure ones.

It would have been natural of ZFS on Linux to take the same attitude towards what went into sharenfs on Linux, and indeed the current wording of the manpage sort of implies that this is what's happening and that you can simply use what you'd put in exports(5). Unfortunately, this is not the case. Instead, ZFS on Linux attempts to interpret your sharenfs setting as OmniOS NFS share options and tries to convert them to equivalent Linux options.

(I assume that this was done to make it theoretically easier to move pools and filesystems between ZoL and Illumos/Solaris ZFS, because the sharenfs property would mean the same thing and be interpreted the same way on both systems. Moving filesystems back and forth is not as crazy as it sounds, given zfs send and zfs receive.)

There are two problems with this. The first is that the conversion process doesn't handle all of the Illumos NFS share options. Some it will completely reject or fail on (they're just totally unsupported), while others it will accept but produce incorrect conversions that don't work. The set of accepted and properly handled conversions is not documented and is unlikely to ever be. The second problem is that Linux can do things with NFS share options that Illumos doesn't support (the reverse is true too, but less directly relevant). Since ZFS on Linux provides you no way to directly set Linux share options, you can't use these Linux specific NFS share options at all through sharenfs.

Effectively what the current ZFS on Linux approach does is that it restricts you to an undocumented subset of the Illumos NFS share options are supported by Linux and correctly converted by ZoL. If you're doing anything at all sophisticated with your NFS sharing options (as we are), this means that using sharenfs on Linux is simply not an option. We're going to have to roll our own NFS share option handling and management system, which is a bit irritating.

(We're also going to have to make sure that we block or exclude sharenfs properties from being transferred from our OmniOS fileservers to our ZoL fileservers during 'zfs send | zfs receive' copies, which is a problem that hadn't occurred to me until I wrote this entry.)

PS: There is an open ZFS on Linux issue to fix the documentation; it includes mentions of some mis-parsing sharenfs bugs. I may even have the time and energy to contribute a patch at some point.

PPS: Probably what we should do is embed our Linux NFS share options as a ZFS filesystem user property. This would at least allow our future management system to scan the current ZFS filesystems to see what the active NFS shares and share options should be, as opposed to having to also consult and trust some additional source of information for that.

ZFSOnLinuxSharenfsProblem written at 01:03:29; Add Comment

(Previous 10 or go back to June 2018 at 2018/06/30)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.