Our ZFS spares handling system for ZFS on Linux
When we ran Solaris fileservers and then OmniOS fileservers we ended up building our own system for handling replacing failed disks with spares, which I wrote about years ago in part 1, 2, 3, and 4. When we migrated to our current generation of Linux based ZFS fileservers, many of our local software for OmniOS migrated over almost completely unchanged. This included (and includes) our ZFS spares system, which remains mostly unchanged from the Solaris and OmniOS era (both in how it operates and in the actual code involved).
The first important aspect of our spares system is that it is still state driven, not event driven. Rather than trying to hook into ZED to catch and handle events, our spares driver program operates by inspecting the state of all of our pools and attempting to start any disk replacement that's necessary (and possible). We do use ZED to immediately run the spares driver in response to both ZED vdev state change events (which can be a disk failing) and pool resilvers finishing (because a resilver finishing can let us start more disk replacements). We also run the spares driver periodically from cron as a backup to ZED; even if ZED isn't running or misses events for some reason, we will eventually notice problems.
Our Solaris and OmniOS fileservers used iSCSI, so we had to carefully maintain a list of what iSCSI disks were potential spares for each fileserver (a fileserver couldn't necessarily use any iSCSI disk visible to it). Since our Linux fileservers only have local disks, we could get rid of these lists; the spares driver can now use any free disks it sees and its knowledge of available spares is always up to date.
(As before, these 'disks' are actually fixed size partitions on our SSDs, with four partitions per SSD. We are so immersed in our world that we habitually call these 'disks' even though they aren't.)
As in the iSCSI world, we don't pick replacement disks randomly; instead there is a preference system. Our fileservers have half their disks on SATA and half on SAS, and our regular mirrored pairs use the same partition from matching disks (so the first partition on the first SATA disk is in a mirror vdev with the first partition on the first SAS disk). Spare replacement tries to pick a replacement disk partition on the same type of disk (SATA or SAS) as the dead disk; if it can't find one, it falls back to 'any free partition' (which can happen if we use up almost all of the available space on a fileserver, which has already happened on one).
In the past, with HDs over iSCSI, we had to carefully limit the number of resilvers that we did at once in order to not overwhelm the system; our normal limit was replacing only one 'disk' (a partition) at a time. Our experience with local SSDs is that this is no longer really a problem, so now we will replace up to four failed partitions at once, which normally means that if a SSD fails we immediately start resilvers for everything that was on it. This has made a certain amount of old load limiting code in the spares driver basically pointless, but we haven't bothered to remove it.
For inspecting the state of ZFS pools, we continue to rely on our local C program to read out ZFS pool state. It ported from OmniOS to ZFS on Linux with almost no changes, although getting it to compile on Ubuntu 18.04 was a bit of a pain because of how Ubuntu packages ZFS there. It's possible that ZFS on Linux now has official APIs that would provide this information, but our existing code works now so I haven't had any interest in investigating the current state of any official API for ZFS pool information.
Linux PAM leads to terrible error messages from things like
Here is a puzzle for you. Suppose that you're trying to change your password on a typical Linux system, as happens periodically (and as we make new logins on our systems do immediately), and you get the following:
; passwd Changing password for user cks. Current password: passwd: Authentication token manipulation error
What has gone wrong here? What should you do to fix it? Should you try again, or instead send email to your system administrators to get them to fix it?
Well, you don't know, because
passwd and Linux's implementation
of PAM have combined to
create a terrible error message through robot logic, where the error message is completely technically
logical and correct but useless in practice. The most likely cause
of this message is that you've mis-typed your current password,
but there are other possible causes if things have gone wrong in
the depths of PAM. The only people who can start to disentangle
this is your system administrators, or in general anyone who can
look at PAM's logs (normally in syslog), because only there will
you find extremely valuable and much more precise messages like:
passwd: pam_unix(passwd:chauthtok): authentication failure; logname= uid=19 euid=0 tty=pts/6 ruser= rhost= user=cks
Even this isn't really clear, but with sufficient painful experience
you can decode this to that the passwd command was verifying your
password through traditional Unix
/etc/shadow encrypted passwords,
and the password you typed didn't 'authenticate', ie didn't match
the encrypted password.
One of the reasons this is a terrible error message is because normal people have essentially no chance at all of understanding it (as I can assure you from our experience of supporting the people who use our systems). The best you can do is use a wrapper script that puts a big explanatory message around the whole thing, and even then people get confused.
(And if other things go wrong and the same message gets printed out, you're really confusing people; you've claimed that the problem is that they're using the wrong password, except they know that they're not. At least they'll probably email the system administrators at that point.)
I'm not sure if the PAM API provides any way for PAM modules such
as pam_unix to provide a more specific error message. This
particular error message is the generic PAM error string for
PAM_AUTHTOK_ERR, which is the equally generic PAM error code
that pam_unix is forced to return in this situation. You can
see the full list in the pam(3) manpage.
Keeping backup ZFS on Linux kernel modules around
I'm a long term user of ZFS on Linux and over pretty much all of the time I've used it, I've built it from the latest development version. Generally this means I update my ZoL build at the same time as I update my Fedora kernel, since a ZoL update requires a kernel reboot anyway. This is a little bit daring, of course, although the ZoL development version has generally been quite solid (and this way I get the latest features and improvements long before I otherwise would).
One of the things I do to make it less alarming is that I always keep backup copies of previous versions of ZFS on Linux, in the form of copies of the RPMs I install and update. Naturally I keep these backup copies in a non-ZFS filesystem, because I need to be able to get at them even if the new version of ZFS isn't working (possibly just with the new kernel, possibly in general). I haven't needed these backup copies very often, but on the rare occasions when I've had to revert, I was very glad that they were there.
(You don't always run into immediate failures to bring ZFS up; sometimes there are merely stability or other issues in a new development change, and you want to roll back to a previous one. In those cases it's okay to have the previous versions on a ZFS filesystem, because you can probably use ZFS enough to grab them.)
Not everyone uses development versions of ZFS on Linux, but I suggest that you keep backup copies of older versions even if you only use released ZoL versions. You never know when you may run into an issue and be glad that you have options.
(That I keep backup copies of previous versions and want to have them accessible outside of ZFS is one reason that I doubt I'll ever use ZFS on Linux on my root filesystem. System recovery is much easier in many scenarios if ZFS isn't required to at least boot the system, get it on the network, or access the root filesystem from a live CD.)
One obvious requirement here is that you should never update ZFS pool or filesystem features until you're absolutely sure that you'll never want to revert to a ZoL version that's too old to support those features. This generally makes me quite conservative about updating pool features; I want them to be in a ZoL release that's been out long enough to be considered fully stable.
In praise of ZFS On Linux's ZED 'ZFS Event Daemon'
I've written before (here) about how our current Linux ZFS fileservers work much like our old OmniOS fileservers. However, not everything is quite the same between ZFS on Linux and traditional Solaris/OmniOS ZFS. One of the most welcome differences for us is ZED, the ZFS Event Daemon. What ZED does that is so great is that it provides a very simple way to take action when ZFS events happen.
When a ZFS event happens, ZED looks through a directory (generally
/etc/zfs/zed.d) to find scripts (or programs) that should be run
in response to the event. Each script is run with a bunch of
environment variables set to describe what's going on, and it can
use those environment variables to figure out what the event is.
ZED decides what things to run based on their names; generally you
wind up with script names like
all-cslab.sh (which is run on
all events) and
resilver_finish-cslab.sh (which is run when a
Because these are just a collection of individual files, you're free to add your own without colliding with or having to alter the standard 'ZEDLETs' provided by ZFS on Linux. Your additions can do anything you want them to, ranging from the simple to the complex. For instance, our simplest ZEDLET simply syslogs all of the ZED environment variables:
PATH=/usr/bin:/usr/sbin:/bin:/sbin:$PATH export PATH if [ "$ZEVENT_SUBCLASS" = "history_event" ]; then exit 0 fi unset ZEVENT_TIME unset ZEVENT_TIME_STRING printenv | fgrep 'ZEVENT_' | sort | fmt -999 | logger -p daemon.info -t 'cslab-zevents' exit 0
(There's a standard 'all-syslog.sh' ZEDLET, but it doesn't syslog all of the information in the zevents. Capturing all of the information is especially useful if you want to write additional ZEDLETs and aren't quite sure what they should look for or what environment variables have useful information.)
It can take a bit of time and experimentation to sort out what ZFS events are generated (and with what information available) in response to various things happening to adn in your ZFS pools. But once you have figured it out, ZED gives you a way to trigger and drive all sorts of system management activities. These can be active (like taking action if devices fail) or passive (like adding markers in your metrics system or performance dashboards for when ZFS scrubs or resilvers start and end, so you can correlate this with other things happening).
Coming from Solaris and OmniOS, where there was no such simple system for reacting to things happening in your ZFS pools, ZED was a breath of fresh air for us. More than anything else, it feels like how ZFS events should have been handled from the start, so that system administrators could flexibly meet their own local needs rather than having to accept whatever the Solaris Fault Management system wanted to give them.
PS: Because ZFS on Linux is now OpenZFS, I believe that ZED will probably eventually show up in FreeBSD (if it isn't already there). Perhaps it will even some day be ported back to Illumos.
Linux desktop application autostarting is different from systemd user units
When I wrote about how applications autostart on modern Linux
desktops, there was a Reddit discussion,
and one of the people there noted that things could also be autostarted
through systemd user units. As covered in the Arch Wiki page on
that are systemd based generally automatically start a '
--user' user systemd instance for you, and one of the things this
will do is it will start things in
~/.config/systemd/user, which you can manipulate.
However, there are some significant differences between the two that help explain why Linux desktops don't use systemd user units. The big one is that systemd user units are per-user, not per-session. By their nature, desktop applications are a per session thing and so not a great fit for a per-user system. In fact even getting systemd user units to be able to talk to your desktop session takes what is basically a hack, as covered in the Arch wiki section on DISPLAY and XAUTHORITY, and this hack must be carefully timed so that it works correctly (it has to happen before units that need to talk to your desktop are started, and that means they have to be terminated when you log out).
Desktops also have a lot more fine control over what gets started with their current mechanisms. Obviously these things only get started for desktop sessions, not things like SSH logins, and they can be specific to certain desktops or not start in some desktops. I don't believe there is a native systemd unit option for 'run only if this environment variable is defined', so you can't readily make a systemd unit that only runs in desktop sessions, never mind only a particular sort of desktop.
(Relying to any significant degree on user units would also more strongly tie desktops to systemd, although I don't know if that's something they worry about these days or if it's full steam ahead on systemd in general.)
My general impression is that systemd user .service units are not a good fit for what most users want and do with autostarting things today, whether or not they're using a desktop. Systemd user units are probably a better fit for socket and dbus units, because those are more naturally activated on the fly as needed, but I don't know if people are doing this very much (especially for desktop related things).
(As a practical matter, I'd consider it pretty obnoxious if a program decided to set itself to autostart as a systemd user unit. I suspect I'm not alone in this.)
Ubuntu, building current versions of Firefox, and snaps
Today on Twitter, I said:
Given that Ubuntu's ostensible logic for putting Chrome in a snap is 'it makes maintaining it easier', my cynical side expects Ubuntu to also do this with Firefox before too long (due to Firefox's need for steadily increasing versions of Rust).
Ubuntu ships current versions of Firefox (at the moment Firefox 78) in Ubuntu LTS releases, which means that they must build current versions of Firefox on all supported Ubuntu LTS versions. Firefox is built partly with Rust (among other things), and new releases of Firefox often require relatively recent versions of Rust; for instance, right now Firefox Nightly (which will become Firefox 80 or 81) requires Rust 1.43.0 or better. Nor is Rust the only thing that Firefox has minimum version requirements for. Firefox 78, the current release, requires nasm 2.14 or better if you want to build the AV1 codecs, and I'm sure there are others I just haven't tripped over yet.
This is a problem for Ubuntu because Ubuntu famously doesn't like
updating packages on Ubuntu LTS (or probably any Ubuntu release,
but I only have experience with LTS releases). Today, the need to
build current Firefox versions on old Ubuntu LTS releases means
that Ubuntu 16.04 has been dragged up to Rust 1.41.0 (the same Rust
version that's on 18.04 and 20.04). If current versions of Rust
weren't required to build Firefox, Rust on 16.04 would probably be
a lot like Go, where the default is version 1.6 (that's the
package version) and the most recent available one is Go 1.10 (which
actually dates from 2018, which is modern for an LTS release from
2016). When Firefox 80 or so is released and requires Rust 1.43.0
or better, Ubuntu will have to update Rust again on all of the still
supported LTS versions, which will probably still include 16.04 at
Canonical can't like this. At the same time, they have to ship Firefox and they have to keep it current, for security reasons. Shipping Firefox as a Snap would deal with both problems, because Canonical would no longer need to be able to build the current Firefox from source on every supported Ubuntu release (LTS and otherwise, but the oldest ones are generally LTS releases). Given that Canonical wants to shove everyone into Snaps in general, I rather expect that they're going to do this to Firefox sooner or later.
PS: I'm not looking forward to this, because Snaps don't work with NFS mounted home directories or in our environment in general. Ubuntu moving Firefox to a Snap would probably cause us to use the official Mozilla precompiled binaries in the short term, and push us more toward another Linux release in the longer term (probably Debian).
Some thoughts on Fedora moving to btrfs as the default desktop file system
The news of the time interval for me is that there is a Fedora change proposal to make btrfs the default file system for Fedora desktop (via, itself via; see also the mailing list post). Given that in the past I've been a btrfs sceptic (eg, from 2015), long time readers might expect me to have some views here. However, this time around my views are cautiously optimistic for btrfs (and Fedora), although I will only be watching from a safe distance.
The first two things to note are that 2015 is a long time ago (in computer time) and I'm too out of touch with btrfs to have an informed opinion on its current state. I'm confident that people in Fedora wouldn't have proposed this change if there weren't good reasons to believe that btrfs is up to the task. The current btrfs status looks pretty good on a skim, although the section on device replacement makes me a little alarmed. The Fedora proposal also covers who else is using btrfs and has been for some time, and it's a solid list that suggest btrfs is not going to explode for Fedora users.
I'm a big proponent of modern filesystems with data and metadata checksums, so I like that aspect of btrfs. As far as performance goes, most people on desktops are unlikely to notice the difference, and as a long term user of ZFS on Linux I can testify how nice it is to not have to preallocate space to specific filesystems (even if with LVM you can grow them later).
However, I do feel that this is Fedora being a bit adventurous. This is in line with Fedora's goals and general stance of being a relatively fearless leading edge distribution, but at the same time sometimes the leading edge is also the bleeding edge. I would not personally install a new Fedora machine with btrfs in the first few releases of Fedora that defaulted to it, because I expect that there will be teething problems. Some of these may be in btrfs, but others will be in system management programs and practices that don't cope with btrfs or conflict with it.
In the long run I think that this change to btrfs will be good for Fedora and for Linux as a whole. Ext4 is a perfectly decent filesystem (and software RAID works fine), but it's possible to do much better, as ZFS has demonstrated for a long time.