Fetching really new Fedora packages with Bodhi
Normal Fedora updates that have been fully released are available
through the regular
updates repository, which is (or should be)
already configured into
dnf on your Fedora system. More recent
(and less well tested) updates are available through the
repository, which you can selectively enable in order to see if
what you're looking for is there. Right now I'm interested in Rust
1.28, because it's now required to build the latest Firefox from
# dnf --enablerepo=updates-testing check-update 'rust*' Last metadata expiration check: 0:00:56 ago on Fri 10 Aug 2018 02:12:32 PM EDT. #
However sometimes, as in this case and past ones, any update that actually exists is too new to even have made it into the updates-testing DNF repo. Fedora does their packaging stuff through Fedora Bodhi (see also), and as part of this packages can be built and available in Bodhi even before they're pushed to updates-testing, so if you want the very freshest bits you want to check in Bodhi.
There are two ways to check Bodhi; through the command line using
bodhi client (which comes from the bodhi-client package), or
through the website. Perhaps
I should use the client all the time, but I tend to reach for the
website as my first check. The URL for a specific package on the
website is of the form:
For example, https://bodhi.fedoraproject.org/updates/?packages=rust is the URL for Rust (and there's a RSS feed if you care a lot about a particular package). For casual use, it's probably easier to just search from Bodhi's main page.
Through the command line, checking for and downloading an update looks like this:
; bodhi updates query --packages rust --releases f28 --status pending ============================= [...] rust-1.28.0-2.fc28 ============================= [...] Update ID: FEDORA-2018-42024244f2 [...] Notes: New versions of Rust and related tools -- see the release notes : for [1.28](https://blog.rust-lang.org/2018/08/02/Rust-1.28.html). Submitter: jistone Submitted: 2018-08-10 14:35:56 [...]
We insist on the
pending status because that cuts the listing
down and normally gives us only one package, where we get to see
detailed information about it; I believe that there's normally
only one package in pending status for a particular Fedora release.
If there's multiple ones, you get a less helpful summary listing
that will give you only the full package name instead of the update
ID. If you can't get the update ID through
bodhi, you can always
get it through the website by clicking on the link to the
specific package version on the package's page.
To fetch all of the binary RPMs for an update:
; cd /tmp/scratch ; bodhi updates download --updateid FEDORA-2018-42024244f2 [...]
; cd /tmp/scratch ; bodhi updates download --builds rust-1.28.0-2.fc28 [...]
Both versions of the
bodhi command download things to the current
directory, which is why I change to a scratch directory first. Then
you can do '
dnf update /tmp/scratch/*.rpm'. If the resulting
packages work and you feel like it, you can leave feedback on the
Bodhi page for the package, which may help get it released into the
updates-testing repo and then eventually the updates repo.
(In theory you can leave feedback through the
bodhi command too,
but it requires more setup and I think has somewhat less options
than the website.)
As far as I've seen, installing RPMs this way will cause things to
remember that you installed them by hand, even when they later
become available through the
updates-testing or the
repo. This is probably not important to you.
(I decided I wanted an actual entry on this process that I can find easily later, instead of having to hunt around for my postscript in this entry the next time I need it.)
DynamicUser feature is (currently) dangerous
Yesterday I described how timesynd couldn't be restarted on one of our Ubuntu 18.04 machines, where the specific thing that caused the failure was timesyncd attempting to access /var/lib/private/systemd/timesync and failing because /var/lib/private is only accessible by root, not the UID that timesyncd was running as. My diagnostic efforts left me puzzled as to how this was supposed to work at all, but Trent Lloyd (@lathiat) pointed me to the answer, which is in Lennart Poettering's article Dynamic Users with systemd, which introduces the overall system, explains the role of /var/lib/private, and covers how timesyncd is supposed to get access through an inaccessible directory. I'll quote the explanation for that:
[Access through /var/lib/private] is achieved by invoking the service process in a slightly modified mount name-space: it will see most of the file hierarchy the same way as everything else on the system ([...]), except for
/var/lib/private, which is over-mounted with a read-only
tmpfsfile system instance, with a slightly more liberal access mode permitting the service read access. [...]
Since timesyncd is not able to get access through /var/lib/private,
you might guess that something has gone wrong in the process of
setting up this slightly modified mount namespace. Indeed this
turned out to be the case. The machine that this happened on is an
NFS client and (as is usual) its UID 0 is mapped to an unprivileged
UID on our fileservers. On this
machine there were some FUSE mounts in the home directories of users
who have their
$HOME not world readable (our default
permissions are owner-only, to avoid accidents). When systemd was
setting up the 'slightly modified mount name-space' it attempted
to access these FUSE mounts as part of binding them into the
namespace, but it failed because UID 0 had no permissions to look
inside user home directories.
This failure caused systemd to give up attempting to set up the
namespace. However, systemd did not abort unit activation or even
log an error message. Instead it continued on to try to start
timesyncd without this special namespace, despite the fact that
timesyncd uses both
StateDirectory and so
starting it normally was essentially absolutely guaranteed to fail.
(Although my initial case was dangling FUSE mounts, it soon developed that any FUSE mounts would do it, for example a sshfs or smbfs mount in a user's NFS mounted home directory when the home directory isn't world-accessible.)
Systemd's failure to handle errors in setting up the namespace here
has been raised as systemd issue 9835. However, merely
logging an error or aborting the unit activation would not actually
fix the core problem; it would merely let you see exactly why your
timesyncd or whatever service is failing to start. The core problem
is that systemd's current design for
blows up if systemd and UID 0 don't have full access to every mount
that's visible on the system.
(Well, DynamicUser plus StateDirectory, but the idea seems to be that pretty much every service using dynamic users will have a systemd managed state directory.)
In my opinion, this makes using
DynamicUser surprisingly dangerous.
A systemd service that is set to use it can't be reliably started
or restarted on all systems; it only works on some systems, some
of the time (but those happen to be the common case). If there's
ever a problem setting up the special namespace that each such
service requires, things fail. Machines that are NFS clients are
the obvious case, since the client's UID 0 often has limited
privileges, but I believe that there are likely to be others.
(And of course services can be restarted for random and somewhat unpredictable reasons, such as package updates or other services being restarted. You should not assume that you can always control these circumstances, or completely predict the state of the system when they happen.)
A timesyncd total failure and systemd's complete lack of debugability
Last November, I wrote an entry about how we were switching to using systemd's timesyncd on our Ubuntu machines. Ubuntu 18.04 defaults to using timesyncd just as 16.04 does, and when we set up our standard Ubuntu 18.04 environment we stuck with that default behavior (although we customize the list of NTP servers). Then today I discovered that timesyncd had silently died on one of our 18.04 servers back on July 20th, and worse it couldn't be restarted.
Specifically, it reported:
systemd-timesyncd: Failed to create state directory: Permission denied
The state directory it's complaining about is /var/lib/systemd/timesync,
which is actually a symlink to /var/lib/private/systemd/timesync
(at least on systems that are in good order; if the symlink has had
something happen to it, you can apparently get other errors from
timesyncd). I had a clever informed theory about what was wrong
with things, but it turns out
strace says I'm wrong.
(To my surprise, doing '
strace -f -p 1' on this system did not
produce either explosions or an impossibly large amount of output.
This would have been a very different thing on a system that was
actually in use; this is basically an almost idle server being used
as part of our testing of 18.04 before we upgrade our production
servers to it.)
strace, what is failing is timesyncd's attempts to
access /var/lib/private/systemd/timesync as its special UID (and
GID) 'systemd-timesync'. This is failing for the prosaic reason
that /var/lib/private is owner-only and owned by root. Since this
works on all of our other Ubuntu 18.04 machines, presumably the
actual failure is somewhere else.
The real problem here is that it is impossible to diagnose or debug
this situation. Simply to get this far I had to read the systemd
source code (to find the code in timesyncd that printed this specific
error message) and then search through 25,000 lines of
output. And I still don't know what the problem is or how to fix
it. I'm not even confident that rebooting the server will change
anything, especially when all the relevant pieces on this server
seem to be just the same as the pieces on other, working servers.
(I do know that according to logs this failure started happening immediately after the systemd package was upgraded and re-executed itself. On the other hand, the systemd upgrade also happened on other Ubuntu 18.04 machines, and they didn't have their timesyncds explode.)
Since systemd has no clear diagnostic information here, I spent a great deal of time chasing the red herring that if you look at /var/lib/private/systemd/timesync on such a failing system, it will be owned by a numeric UID and GID, while on working systems it will be the magically special login and group 'systemd-timesync'. This is systemd's 'dynamic user' facility in action, combined with systemd itself creating the /var/lib/private/systemd/timesync directory (with the right login and group) before exec'ing the timesyncd binary. When timesyncd fails to start, systemd removes the login and group but leaves the directory behind, now not owned by any existing login or group.
(You might think that the 'failed to create state directory' error
message would mean that timesyncd was the one actually creating the
state directory, but strace says otherwise; the
exec() does, while the new process that will become
timesyncd is still in systemd's code. timesyncd's code does try to
create the directory, but presumably the internal systemd functions
it's using are fine if the directory is already there with the right
ownership and so on.)
I am rather unhappy about this situation, and I am even unhappier that there is effectively nothing that we can do about any aspect of it except to stop using timesyncd (which is now something that I will be arguing for, especially since this server drifted more than half a second out of synchronization before I found this issue entirely by coincidence). Reporting a bug to either systemd or to Ubuntu is hopeless (systemd will tell me to reproduce on the latest version, Ubuntu will ignore it as always). This is simply what happens when the systemd developers produce a design and an implementation that doesn't explain how it actually works and doesn't contain any real support for field diagnosis. Once again we get to return to the era of 'reboot the server, maybe that will fix it'. Given systemd's general current attitude, I don't expect this to change any time soon. Adding documentation of systemd's internals and diagnosis probes would be admitting that the internals can have bugs, problems, and issues, and that's just not supposed to happen.
PS: The extra stupid thing about the whole situation is that the only thing /var/lib/systemd/timesync is used for is to hold a zero-length file whose timestamp is used to track the last time the clock was synchronized, and non-root users can't even see this file on Ubuntu 18.04.
Update: I've identified the cause of this problem, which is
described in my new entry on how systemd's
is dangerous. The short version is
that systemd silently failed to set up a custom namespace that would
have given timesyncd access to /var/lib/private because it could
not deal with FUSE mounts in NFS mounted user home directories that
were not world-accessible.
/dev/disk/by-path names for disks change over time
I have in the past written about the many names of SATA disks and on the names of SAS drives,
and in both cases one of the sorts of names I talked about was the
/dev/disk/by-path names. Unlike the various other names of disks,
which are generally kernel based, these names come from the inscrutable
depths of udev. It will
probably not surprise you to hear that udev periodically changes its
mind about what to call things (or, sometimes, has problems figuring
Due to our new fileserver hardware, I can give you two examples
of how this has changed, one for SAS devices and one for SATA ones.
First, for SATA disks that are directly attached to SAS ports, udev
now provides disk names that use the SAS PHY number instead of the
nominal 'SAS address', resulting in names like
pci-0000:19:00.0-sas-phy2-lun-0. There is still a
/sys/block/sdX/device/sas_address file, I believe with the same
contents as before, it's just that udev now just uses the PHY number.
This is convenient for us, since SAS PHY numbers seem to be the best
way of identifying the physical disk slot on our hardware. Udev's
SAS PHY numbers start from 0.
For SATA disks that are directly attached to SATA ports, udev now
uses names that directly refer to the
ataN names of the drives
(at least for drives that aren't behind port multipliers; udev probably still mangles the
names of SATA disks behind port multipliers). This gives you names
pci-0000:00:17.0-ata-2. Much like the kernel, udev's ATA
numbers start from one, and they're relative to the controller; our
new systems have both
(This switch may be partly due to ATA numbers now appearing in sysfs, as very helpfully noted by Georg Sauthoff in a comment from last year on my old entry. This sysfs change happened sometime between CentOS 6's kernel (some version of 2.6.32) and CentOS 7's kernel (some version of 3.10.0).)
Notice that udev is not necessarily consistent with itself in naming standards. Directly connected SATA disks use 'ata-N', with a dash between the fixed name and the number, while SAS disks use 'phyN', with no dash. I suspect that different people write the code for different sorts of devices, and do whatever they feel is the best option.
(I believe that all of these names are hard-coded in udev itself, not set up through udev rules.)
Generally any competently run Linux distribution is not going to
/dev/disk/by-path names to change over the lifetime
of any particular release of the distribution. They may well change
from release to release, though, especially for major jumps (for
example, between Ubuntu LTS releases). This is a potential issue
if you have things that use these names and rely on them staying
constant. One possible case is ZFS on Linux,
especially given how it handles disk names;
however, the usual recommendation for ZoL is to use
names, which should really be stable over the long term.
(I don't know if they actually have been, although my ZoL pools haven't suffered any explosions due to this over the several years and fair number of Fedora releases that I've been running them.)
PS: To my surprise, none of our Ubuntu 14.04 systems even have a
/dev/disk/by-path directory. I suspect that this is some 14.04
peculiarity, since CentOS 6 is even older and does have a
directory, and this old entry
says that at least some of our 12.04 systems also had it. We don't
normally use any of the /dev/disk/by-* directories on our regular
Ubuntu servers, which is probably why I didn't notice before now.
Ubuntu 18.04's problem with Amanda's
If you use Amanda to back up your machines (as we do), and you have
just added some Ubuntu 18.04 LTS machines to your fleet and installed
amanda-client Ubuntu package to get the necessary client
programs, you may some day fire up
amrecover on one of them to
restore some of those backups. Well, to attempt to restore those
# amrecover -s <server> -t <server> -C <s_config> AMRECOVER Version 3.5.1. Contacting server on <server> ... [request failed: amrecover: error [exec /usr/lib/amanda/ambind: No such file or directory]]
Our Amanda servers are running Ubuntu 16.04 LTS, with Amanda 3.3.6.
Given this error message (and also the fact that
takes several seconds to produce it), we concluded that the 3.5.1
amrecover now requires the Amanda server to have this new
program (which only appeared in 3.5). This seemed about par for the
course for Ubuntu in 18.04, given issues like libreadline6.
This turns out not to be the case (to my disgusted surprise). Despite
how the error message looks, it's the Amanda client (the 18.04
machine) that needs
ambind, not the server;
is trying to directly execute
ambind and failing because indeed
ambind's not there. The reason that it's not there is that Ubuntu
ambind into the
amanda-server package instead of either
amanda-client (which would be appropriate if it's only needed by
amanda-common (if it's also needed by Amanda
server programs). You probably haven't installed the
package on your Amanda client machines because, really, why would
The good news is that this is easily fixed. Just install
as well as
amanda-client on all of your Ubuntu 18.04 Amanda
clients, and everything should be fine. As far as I can tell,
installing the server package doesn't do anything dangerous like
enable services; it just adds some more programs and manpages.
This packaging issue appears to be inherited from Debian, where the
current 'buster (testing)'
packages of 3.5.1 also put
ambind in the amanda-server package.
However, Debian testing is the rolling 'latest development state'
of Debian, not shipping as an official LTS release the way Ubuntu
PS: This is a terrible error message from
under the circumstances. If your program talks to a server, you
should always make it completely unambiguous about when you're
reporting a local error compared to when you're just relaying an
error from the server. If there is any chance of confusion in your
error messages, you're doing it wrong.
Sidebar: How I worked this out (mostly grim determination and flailing)
We thought we had a workaround in the form of hacking up the Ubuntu
16.04 Amanda 3.3.6 packages and installing them on 18.04, but then
we started to run into various troublesome issues and I decided to
see if there was some way of turning off this 'invoke
the server' behavior with an Amanda configuration setting or
amrecover command line option. So I went off to look at just what
was happening on the Amanda server.
I started by looking at the Amanda server logs. Well, trying to
look at them, because there was absolutely nothing being logged
about this (which is unusual, the Amanda server stuff is usually
quite verbose). My next step was to get out the big hammer and run
strace -f -e trace=file -o /tmp/st-out -p <xinetd's PID>' on the
Amanda server while I invoked
amrecover on the client. This too
was completely empty, so I spent a while wondering if there was
some security setting that was making the
strace not work.
Interspersed with trying to trace the server's actions I was also
reading through the Amanda source code
to try to follow the control flow that sent a message from the
client to invoke
ambind on the server. The problem was that I
couldn't really find anything that looked like this; the only use
ambind I could see seemed entirely inside one file, not
the back and forth exchange to the Amanda server stuff that I'd
expect. However, I could find something that looked a lot like the
'error [exec ...]' error message that was ultimately being printed
All of this led me to run
amrecover itself, and lo
and behold there was the smoking gun:
20360 execve("/usr/lib/amanda/ambind", ["/usr/lib/amanda/ambind", "5"], 0x7ffe980502e8 /* 113 vars */) = -1 ENOENT (No such file or directory)
Then it was just a matter of using packages.ubuntu.com to verify that
ambind was in the
amanda-server package and some testing to verify that installing
it on an 18.04 test machine appeared to make
amrecover happy with
The problem with some non-HiDPI aware applications (is that they're very small)
One problem with a HiDPI monitor is the occasional application that absolutely doesn't upscale. For example, the Java that I need to access the KVM-over-IP console of this locked up NFS fileserver.
Our current fileservers are old enough Supermicro machines that their onboard IPMIs only support KVM-over-IP through a Java Web Start application. Today, I needed to use it from home, and I was only a little bit surprised when the resulting virtual console was, well, tiny on my new home HiDPI display.
(I was a bit surprised at how visually tiny it came out, but then I keep being surprised at how small 'half their normal size' has actually been on my new display. And the virtual console wasn't really a giant window to start with on my non-HiDPI work displays, at least in the basic 'text' VGA mode.)
Some modern applications are HiDPI aware, and others at least provide settings for the fonts and font sizes that they use. It's possible that Supermicro's Java program has settings for this (I was in a hurry so I didn't look, although here's Arch's Java information), but I have a sneaking suspicion that it doesn't. For applications like this, the end result is tiny, hard to read or use application windows, either permanently or until I can find how to adjust the application (which may not be worth it if I only use the app occasionally). Since I'm likely to run into this periodically, I should work out a decent general solution to it someday.
In the bright future of Wayland, it will presumably be theoretically possible to have your Wayland compositor automatically scale windows according to your desires, so old non-HiDPI X applications being run in some sort of compatibility mode can just be zoomed up however much you want (since all of this is OpenGL based, and my understanding is that OpenGL has good support for that kind of thing). In the current reality of X Windows (at least, it's my current reality and hopefully future reality), I need a different solution. To date, I know of two.
The easiest option and one that's probably already available, even in basic X environments like mine, is a screen magnifier such as KMag. KMag has the moderate inconvenience that it can't be told to magnify a given (X) window, although you can awkwardly set it to magnify an area of the screen instead of wherever your mouse cursor is. Since I already have KMag installed, this is probably my default choice.
Other than that, the Arch wiki has a section on unsupported apps which
led me to run_scaled,
a shell script that uses xpra to relatively
transparently run programs with forced scaling. Run_scaled is
a pretty big hammer and it has some drawbacks, partly because your
program is running in a separate X server. I could probably make it
work for Java Web Start stuff with some effort, but it's more awkward
than just resorting to KMag; I'd need to get my browser to run my
javaws cover script instead of the real
(Fedora is already using its alternatives system to pick who gets
/usr/bin/javaws, so in theory I could just set my script
up as that and then pass things off to the IcedTea version.)
(I initially thought of playing tricks with Xephyr plus
xrandr for scaling
the 'display' inside Xephyr, but the more I think about it the less
sense that approach makes. I think I'd be better off using
How we're handling NFS exports for our ZFS on Linux systems
If you have ZFS based NFS fileservers you're normally supposed to
handle setting up NFS export/sharing permissions through ZFS by
setting and updating the
sharenfs property on ZFS filesystems.
ZFS then worries about keeping the system's NFS export permissions
in sync with what (ZFS) filesystems you have mounted, where you
have them mounted, and what their
sharenfs settings are. There
are all sort of convenient aspects of this and it's what we've done
for years on our current fileservers.
Unfortunately this is not an option for us in ZFS on Linux. I sort of covered why in my entry on
sharenfs problem, but I didn't
mention the core issue for us, which is that ZoL's handling of
sharenfs has no support for the Illumos '
root=' option to provide
root access to the NFS filesystem for only certain systems (instead
of all of them).
In that entry I speculated that we'd
embed our NFS export options as a ZFS user property on ZFS filesystems.
This is sort of the intellectually pure option, but we've decided
to take another way. We're going to be managing our NFS export
permissions entirely outside of ZFS, but reusing '
sanfs', our existing
local filesystem management program.
Sanfs's job is to set up and operate filesystems according to our local policies and specifications; it handles things like filesystem quotas and reservations, knows whether the filesystem should be visible on our deliberately restricted web server, and so on. Since its configuration file is the central point that knows about all of our NFS-visible filesystems, we also use it to automatically generate the NFS mount list for our local automounter. The sanfs configuration file is where we specify (Illumos) NFS export options, including any special additions for particular filesystems:
# Global default shareopts nosuid,sec=sys,rw=nfs_ssh,root=nfs_root # Individual filesystems fs3 /h/281 fs3-staff-01 rw+=cks_dev
+= syntax is something that I'm unreasonably happy about;
it exactly captures the change we almost always want to make to NFS
In the Linux ZoL world, the Linux version of
sanfs will still use
the same configuration file and the same format for NFS export
permissions, but instead of just setting the
sharenfs ZFS property
with the final calculated share options, it will convert them over
to Linux NFS export permissions (using some local knowledge and
the general equivalences) and then directly
manage Linux NFS export permissions using an auxiliary script. This
script does two things. First, it writes or updates a per-filesystem
/etc/exports.d that records the current permissions, and
then it pokes
exportfs to update the actual live permissions to
reflect their new state. Among other reasons, recording the state
of things in
/etc/exports.d makes our NFS export permissions
automatically persist over reboots.
(All our NFS exports will use the
mountpoint option, so they're
not active until and unless the ZFS filesystem is mounted.)
One significant part of what makes this work is that we never
actually use any of the convenient things that ZFS's handling of
sharenfs gives you. We always export ZFS filesystems individually,
we never move them around, and we don't export and import pools (at
least not without explicitly unmounting things on clients, for
good reasons). Without a SAN we definitely
can't ever move a pool between physical machines without a lot of
intervention. Basically, once pools and filesystems are created,
they stay there more or less forever.
PS: The Linux version of
sanfs and the current Illumos version
will in fact literally be using the same configuration file, since
we're inevitably going to be operating both at once for a while.
We have terabytes of data to move across in a couple hundred
filesystems, and that's not exactly going to happen fast, especially
when we haven't even finished developing the Linux fileservers.
Linux's NFS exports permissions model compared to Illumos's
As part of our move from OmniOS based fileservers to ones based on Linux, I've recently been looking into how to map our current NFS export permissions into Linux's NFS export permissions. As part of this I've been looking into the similarities and differences between the Linux model of NFS export permissions and the Illumos one. The end results you can get are mostly similar (with one difference that may matter for us someday), but Linux gets there in a significantly different way.
To simplify a bit, in Illumos you have permissions that apply to
things, such as netgroups. If a host would match multiple things,
whichever read or read/write permission is listed first takes
priority (more or less). If you write '
rw=...,ro=...', rw permissions
take priority for any host in both. In Linux, this is inverted; you
have things (aka NFS clients), such as netgroups, that have
permissions and other options specified for them. If a host would
match multiple netgroups, the first matching one wins and specifies
all of the host's permissions and options. This can duplicate the
Illumos read versus read/write behavior but it gives you more
flexibility in general. However, it's more verbose if you have
To see this extra verbosity, consider an Illumos share of
rw=A:B:C,ro=D:E', where all of these are netgroups. In
Linux, you turn this inside out and wind up writing:
@A(rw,...) @B(rw,...) @C(rw,...) @D(ro,...) @E(ro,...)
As far as I know Linux has no way to specify 'any of these N netgroups' in a single match, so you have to have a separate entry for each netgroup. If you do this a lot you presumably create yourself a superset netgroup, but that doesn't necessarily scale if you're doing this on an ad-hoc basis with various different shares, as we are.
The one place where Illumos and Linux are different in an important
way for us is remapping or not remapping UID 0. Illumos supports a
root=' option, where hosts specified in it don't remap UID 0,
and this is applied to them separately from whether they have read
or read/write permissions. In Linux, UID 0's mapping (or lack of
it) is part of a NFS client's options, and so it must be specified
together with whether the client has read or read/write permission.
This makes it impossible to translate some Illumos
without changing your netgroups and makes translating others require
local knowledge (for example, of what netgroups are a subset of
what other ones).
(Linux is more flexible here in some ways, but you have to want to map UID 0 to different UIDs for different clients.)
We're fortunately not doing anything tricky with our Illumos
permissions; the machines that we give root access to are always a
subset of the machines that we give read/write access to. With this
local knowledge in hand, it's easy (but verbose) to automatically
translate Illumos ZFS
sharenfs settings to equivalent Linux ones,
although we can't manage them through ZFS on Linux's
PS: The Linux NFS(v3) server doesn't support the sort of general UID and GID remapping that Illumos does; it only remaps UID 0. This fortunately doesn't matter for us in general, although it's very slightly inconvenient for me.
PPS: For NFS exporting ZFS filesystems specifically, you probably
want to include the Linux
crossmnt share option because, if I'm
reading the tea leaves correctly, it allows NFS clients to have
access to the filesystem's
.zfs/snapshot pseudo-directory of ZFS
snapshots, which are independent sub-filesystems. This is automatic
Sometimes it actually is a Linux kernel bug
For my sins, I use a number of third party 'out of kernel' kernel
modules in my Fedora kernels, especially on my office workstation. I don't use a binary GPU driver, but there's the
latest git tip of ZFS on Linux, VMWare's kernel modules, and an out of tree it87
module in order to support my
motherboard's sensors (for as long as that keeps working).
Usually this works fine. Usually. For the past few days, my office
machine has been panicing during our nightly backups when Amanda
runs a big
tar over some of my filesystems.
There's a long standing saying in programming that 'it's never a compiler bug'. I have a similar rule of thumb about kernel panics; given that I use a number of third party modules, especially the VMWare modules, any kernel panics I run into are caused by them.
(I mean, apart from the system lockups, which are AMD's fault, and the amdgpu problems, which was a graphics driver issue. It's a rule of thumb, and it's mostly true about core kernel code. Linux kernel driver code is a little bit more likely to have bugs.)
So I assumed that the cause of my sudden panics was probably ZFS on Linux (an assumption helped along when I accidentally ran my machine without the VMWare modules and it still paniced). After some diagnostic work, I reduced things down to a belief of 'ZoL and the latest Fedora kernels don't like each other', went to report an issue, and found ZoL issue #7723 and thus Fedora #1598462. Which led to my tweet:
Today I learned that Fedora 27 and 28 kernels after 4.17.3 are known to panic under high IO load. Better late than never, but I could have used that knowledge before upgrading the office machine to 4.17.5.
So, yeah. To my surprise, this actually is a (general) Linux kernel bug, not any of the third party modules it happens. This feels like the equivalent of finding a genuine compiler bug.
(I can be pretty sure that I'm hitting the same bug, because I have a basic netconsole (also) setup, and my panic messages match the bug report's. They also run through ZFS functions, which didn't help my initial suspicions.)
PS: What made this more peculiar is that I've been running the Fedora 27 4.17.5 kernel at home without problems. But then, I don't have good home backups and I don't think I've done anything else recently to stress the home machine's IO. I should revert back to 4.17.3 anyway.
PPS: 4.17.7 kernels are in Fedora Bodhi but apparently not yet in the
updates-testing DNF repo for Fedora 27. It looks like the most
convenient way to get things from Bodhi is with the
program. I'm using it like so:
bodhi updates query --packages kernel --releases f27 --status pending bodhi updates download --builds kernel-4.17.7-100.fc27
(You could leave out the '
--status pending', but if you do you
get a big list of past updates that aren't very interesting. If I'm
fetching something from Bodhi it's because I can't get it anywhere
else, so it's probably an update that's so new that it's not even
in the updates-testing repo.)
ZFS on Linux's
sharenfs problem (of what you can and can't put in it)
ZFS has an idea of 'properties' for both pools and filesystems. To quote
from the Illumos
Properties are divided into two types, native properties and user-defined (or "user") properties. Native properties either export internal statistics or control ZFS behavior. [...]
Filesystem properties are used to control things like whether
compression is on, where a ZFS filesystem is mounted, if it is
read-only or not, and so on. One of those properties is called
sharenfs; it controls whether or not the filesystem is NFS exported,
and what options it's exported with. One of the advantages of having
ZFS manage this for you through the
sharenfs property is that ZFS
will automatically share and unshare things as the ZFS pool and
filesystem are available or not available; you don't have to try
to coordinate the state of your NFS shares and your ZFS filesystem
As I write this, the current ZFS on Linux
says this about
Controls whether the file system is shared via NFS, and what options are to be used. [...] If the property is set to on, the dataset is shared using the default options:
See exports(5) for the meaning of the default options. Otherwise, the exportfs(8) command is invoked with options equivalent to the contents of this property.
That's very interesting wording. It's also kind of a lie, because ZFS on Linux caught itself in a compatibility bear trap (or so I assume).
This wording is essentially the same as the wording in Illumos (and
in the original Solaris manpages). On Solaris, the
property is passed more or less straight to
share_nfs as the NFS share options in
-o argument, and as a result what you put in
just those options. This makes sense; the original Solaris version
of ZFS was not created to be portable to other Unixes, so it made
no attempt to have its
sharesmb) be Unix-independent.
It was part of Solaris, so what went into
sharenfs was Solaris
NFS share options, including obscure ones.
It would have been natural of ZFS on Linux to take the same attitude
towards what went into
sharenfs on Linux, and indeed the current
wording of the manpage sort of implies that this is what's happening
and that you can simply use what you'd put in
this is not the case. Instead, ZFS on Linux attempts to interpret
sharenfs setting as OmniOS NFS share options and tries to
convert them to equivalent Linux options.
(I assume that this was done to make it theoretically easier to
move pools and filesystems between ZoL and Illumos/Solaris ZFS,
sharenfs property would mean the same thing and be
interpreted the same way on both systems. Moving filesystems back
and forth is not as crazy as it sounds, given
zfs send and
There are two problems with this. The first is that the conversion
process doesn't handle all of the Illumos NFS share options. Some
it will completely reject or fail on (they're just totally unsupported),
while others it will accept but produce incorrect conversions that
don't work. The set of accepted and properly handled conversions
is not documented and is unlikely to ever be. The second problem
is that Linux can do things with NFS share options that Illumos
doesn't support (the reverse is true too, but less directly relevant).
Since ZFS on Linux provides you no way to directly set Linux share
options, you can't use these Linux specific NFS share options at
Effectively what the current ZFS on Linux approach does is that it
restricts you to an undocumented subset of the Illumos NFS share
options are supported by Linux and correctly converted by ZoL. If
you're doing anything at all sophisticated with your NFS sharing
options (as we are), this means that using
sharenfs on Linux is
simply not an option. We're going to have to roll our own NFS share
option handling and management system, which is a bit irritating.
(We're also going to have to make sure that we block or exclude
sharenfs properties from being transferred from our OmniOS
fileservers to our ZoL fileservers
zfs send | zfs receive' copies, which is a problem that
hadn't occurred to me until I wrote this entry.)
PS: There is an open ZFS on Linux issue to fix the
documentation; it includes mentions of some mis-parsing
bugs. I may even have the time and energy to contribute a patch at
PPS: Probably what we should do is embed our Linux NFS share options as a ZFS filesystem user property. This would at least allow our future management system to scan the current ZFS filesystems to see what the active NFS shares and share options should be, as opposed to having to also consult and trust some additional source of information for that.