I need to remember to check for ZFS filesystems being mounted
Over on the Fediverse I said something:
I keep re-learning the ZFS lesson that you want to check not only for the mount point of ZFS filesystems but also that they're actually mounted, since ZFS can easily have un-mounted datasets due to eg replication in progress.
We have a variety of management scripts on our fileservers that do things on 'all ZFS
filesystems on this fileserver' or 'a specific ZFS filesystem if
it's hosted on this fileserver'. Generally they get their list of
ZFS filesystems and their locations by looking at the
property (we set an explicit mount location for all of our ZFS
filesystems, instead of using the default locations). Most of the
time this works fine, but every so often one of the scripts has
blown up and we've quietly fixed it to do better.
The problem is that ZFS filesystems can be visible in things like
zfs list' and have a
mountpoint property without actually being
mounted. Most of the time all ZFS filesystems with a
will actually be mounted, so most of the time the simpler version
works. However, every so often we're moving a filesystem around
zfs send' and '
zfs receive', and either an initial
replication of the filesystem sits unmounted on its new home, or
the old version of the now migrated filesystem sits unmounted on
its old fileserver, retained for a while as a safety measure.
It's not hard to fix our scripts, but we have to find them (and
then remember not to make this mistake again when we write new
scripts). This time around I did do a sweep over all of our scripts
looking for use of '
zfs list' and the '
mountpoint' property and
so on, and didn't find anything where we (now) weren't also checking
mounted' property. Hopefully it will stay that way, now that
I've written this entry to remind myself.
Sidebar: Two reasons other filesystems mostly don't have this problem
The obvious reason that other filesystems mostly don't have this
problem is that they sort of don't have a state where they're present
with a mount point assigned but not actually mounted. The less
obvious reason is that most filesystems don't have a separate tool
to list them; instead you look at the output of '
mount' or some
other way of looking at what filesystems are mounted, and that
obviously excludes filesystems that aren't. You can do the same
with ZFS, but using '
zfs list' and so on is often more natural.
(With other filesystems, the rough equivalent is to have a '
/etc/fstab that's not currently mounted. If you get
your list of filesystems from fstab, you'll see the same sort of issue.
Of course in practice you mostly don't look at fstab, since it doesn't
reflect the live state of the system. Things in fstab may be unmounted,
and things not in fstab may be mounted
We do see ZFS checksum failures, but only infrequently
One of the questions hovering behind ZFS is how often, in practice, you actually see data corruption issues that are caught by checksums and other measures, especially on modern solid state disks. On our old OmniOS and iSCSI fileserver environment we saw somewhat regular ZFS checksum failures, but that environment had a lot of moving parts, ranging from iSCSI through spinning rust. Our current fileserver environment uses local SSDs, and initially it seemed we were simply not experiencing checksum failures any more. Over time, though, we have experienced some (well, some not associated with SSDs that failed completely minutes later).
Because there's no in-pool persistent count of errors, I have to extract this information from our worklog reports of clearing checksum errors, which means that I may well have missed some. Our current fileserver infrastructure has been running since around September of 2018, so many pools are now coming up on three and a half years old.
- In early 2019, a SSD experienced an escalating series of checksum
failures over multiple days that eventually caused ZFS to fault the
disk out. We replaced the SSD. No I/O errors were ever reported for
- in mid 2019, a SSD with no I/O errors had a single checksum failure
found in a scrub, which might have come from a NAND block failing and
being reallocated (based on SMART data). The disk is still in service
as far as I can tell, with no other problems.
- at the end of August 2019, an otherwise problem-free SSD had one
checksum error found in a scrub. Again, SMART data suggests it
may have been some sort of NAND block failure that resulted in a
reallocation. The disk is still in service with no other problems.
- in mid 2021, a SSD reported six checksum errors during a scrub. As in all the other cases, SMART data suggests there was a NAND block failure and reallocation, and the disk didn't report any I/O errors. The disk is still in service with no other problems.
(We also had a SSD report a genuine read failure at the end of 2019. ZFS repaired 128 Kb and the pool scrubbed fine afterward.)
So we've seen three incidents of checksum failures (two of which were only for a single ZFS block) on disks that have otherwise been completely fine, and one case where checksum failures were an early warning of disk failures. We started out with six fileservers, each with 16 ZFS data disks, and added a seventh fileserver later (none of these SSD checksum reports are from the newest fileserver). Conservatively, this means that our three or four incidents are across 96 disks.
(At the same time, this means four out of 96 or so SSDs had a checksum problem at some point, which is about a 4% rate.)
We have actually had a number of SSD failures on these fileservers. I'm not going to try to count how many, but I'm pretty certain that there have been more than four. This means that in our fileserver environment, SSDs seem to fail outright more often than they experience checksum failures. Having written this entry, I'm actually surprised by how infrequent checksum failures seem to be.
(I'm not going to try to count SSD failures, because that too would require going back through worklog messages.)
I wish ZFS pools kept a persistent count of various errors
Famously, ZFS pools will report a count of read, write, and checksum errors on the level of the pool, vdevs, and individual devices, counts that are persistent over reboots (and thus pool exports and imports). Equally famously, ZFS expects you to clear these counts when (and if) you resolve problems; for example, if you want to see if you have a persistent checksum problem or a one-time thing, you'll normally clear the error count and re-scrub the pool. This makes these error counts a (persistent) count of recent errors, not a persistent count of errors over the lifetime of the pool, vdev, or device.
What I've come to wish for over the time we've been running our ZFS fileservers is just such a persistent count (as well as persistent information about how many total bytes have been healed or found with unfixable errors). For long term management, it's nice to know this sort of slowly accumulating information. You can look at it at any point to see if something stands out, and you can capture it in metrics systems to track growth over time. Without it, you're relying on fallible human memory (or equally fallible human tracking) to notice that your checksum errors are increasing on this sort of disk, or on disks this old, or keep showing up once every few months on this hardware, and other things like that.
(ZFS has pool history, but even '
zpool history -i' seems to not
include fixable errors under at least some circumstances.)
In my view, the ideal implementation would have persistent error counts on all levels, from the pool down to individual devices. The individual device count would be dropped when a device was removed from the pool (through replacement, for example), but the pool persistent count would live as long as the pool did and the vdev persistent count would, in most cases, be just as long-lived. Since ZFS pool, vdev, and device data is relatively free-form (it's effectively a key-value store), it wouldn't be too hard to add this to ZFS as a new pool feature.
Today, of course, you can do this through good record keeping and perhaps good logging. On Linux at least, the ZFS event daemon provides you with an opportunity to write a persistent log of all ZFS disk errors to some place. On Illumos, probably syseventadm can be made to do the same thing. Detailed logging can also give you more information than counts do; for example, you can see if the problems reoccur in the same spot on the disk, or if they move around.
(Of course 'the same spot on the disk' is not terribly meaningful these days, especially on solid state disks.)
ZFS performance and modern solid state disk systems
A certain amount of ZFS’s nominal performance issues are because ZFS does more random IOs (and from more drives) than other filesystems do. A lot of the stories about these performance issues date from the days when hard drives were dominant, with their very low IOPS figures. I don’t think anyone has done real performance studies in these days of SSDs and especially NVMe drives, but naively I would expect the relative ZFS performance to be much better these days since random IO no longer hurts so much.
There are two aspects of this. First, there are obvious areas of ZFS performance where it was limited by IOPS and bandwidth, such as deduplication and RAIDZ read speeds. Modern NVMe drives have very high values for both, high enough to absorb a lot of reads, and even SATA and SAS SSDs may be fast enough for many purposes. However, there are real uncertainties over things like what latency SSDs may have for isolated reads, so someone would want to test and measure real world performance. For deduplication, it's difficult to get a truly realistic test without actually trying to use it for real, which has an obvious chicken and egg problem.
(ZFS RAIDZ also has other unappealing aspects, like the difficult story around growing a raidz vdev.)
Second and more broadly, there is the question of what does 'good performance' mean on modern solid state disks and how much performance most people can use and care about. If ZFS has good (enough) performance on modern solid state disks, exactly how big the numbers are compared to other alternatives doesn't necessarily matter as much as other ZFS features. Related to this is the question of how does ZFS generally perform on modern solid state disks, especially without extensive tuning, and how far do you have to push programs in order for ZFS to be the performance limit.
(There is an interesting issue for NVMe read performance on Linux, although much of the discussion dates from 2019.)
Of course, possibly people have tested and measured modern ZFS on modern solid state disk setups (SSD or NVMe) and have posted that somewhere. On today's Internet, it's sadly hard to discover this sort of thing through search engines. While we've done some poking at ZFS performance on mirrored SATA SSDs, I don't think we have trustworthy numbers, partly because our primary interest was in performance over NFS on our fileservers, and we definitely observed a number of differences between local and NFS performance.
(My personal hope is that ZFS can saturate a modern SATA SSD in a simple single disk pool configuration (or a mirrored one). I suspect that ZFS can't drive NVMe drives at full speed or as fast as other filesystems can manage, but I hope that it's at least competitive for sequential and random IO. I wouldn't be surprised if ZFS compression reduced overall read speeds on NVMe drives for compressed data.)
Unfortunately, damaged ZFS filesystems can be more or less unrepairable
An unfortunate piece of ZFS news of the time interval is that Ubuntu 21.10 shipped with a serious ZFS bug that created corrupted ZFS filesystems (see the 21.10 release notes; via). This sort of ZFS bug happens from time to time and has likely happened as far back as Solaris ZFS, and there are two unfortunate aspects of them.
(For an example of Solaris ZFS corruption, Solaris ZFS could write ACL data that was bad in a way that it ignored but modern ZFS environments care about. This sort of ZFS issue is not specific to Ubuntu or modern OpenZFS development, although you can certainly blame Ubuntu for this particular case of it and for shipping Ubuntu 21.10 with it.)
The first unfortunate aspect is that many of these bugs normally panic your kernel. At one level it's great that ZFS is loaded with internal integrity and consistency checks that try to make sure the ZFS objects it's dealing with haven't been corrupted. At another level it's not so great that the error handling for integrity problems is generally to panic. Modern versions of OpenZFS has made some progress on allowing some of these problems to continue instead of panic, but there are still a lot left.
The second unfortunate aspect is that generally you can't repair this damage the way you can in more conventional filesystems. Because of ZFS's immutability and checksums, once something makes it to disk with a valid checksum, it's forever. If what made it to disk was broken or corrupted, it stays broken or corrupted; there's no way to fix it in place and no mechanism in ZFS to quietly fix it in a new version. Instead, the only way to get rid of the problem is to delete the corrupted data in some way, generally after copying out as much of the rest of your data as you can (and need to). If you're lucky, you can delete the affected file; if you're somewhat unfortunate, you're going to have to destroy the filesystem; if you're really unlucky, the entire pool needs to be recreated.
This creates two reasons to make regular backups (and not using
zfs send', because that may well just copy the damage to your
backups). The first reason is of course so that you have the backup
to restore from. The second reason is because making a backup with
rsync, or another user level tool of your choice will read
everything in your ZFS filesystems, which creates regular assurance
that everything is free of corruption.
PS: Even if you don't make regular backups, perhaps it's a good
idea just to read all of your ZFS filesystems every so often by
tar'ing them to /dev/null or similar things. I should probably
do this on my home machine, which I am really bad at backing up.
How ZFS stores symbolic links on disk
After writing about ZFS's new 'draid' vdev topology, I wound up curious about how ZFS actually stores the target of symbolic links on disk (which matters for draid, because draid has a relatively large minimum block size). The answer turns out to tie back to another ZFS concept, System Attributes. As a quick summary, ZFS system attributes (SAs) are a way for ZFS to pack a more or less arbitrary collection of additional information, such as the parent directory of things, into ZFS dnodes. Normally this is done using extra space in dnodes that's called the bonus buffer, but it can overflow into a spill block if necessary.
The answer to how ZFS stores the target of symbolic links is that
they are a System Attribute. You can see it listed as
in the enum of known system attributes in
along with a variety of other ones. There's also apparently an older
scheme for storing these dnode attributes, which appears to use a
more or less hard coded structure for them based on the
struct that's also defined in
You're only going to see this scheme if you have very old filesystems,
because it was introduced in 2010 in ZFS filesystem version 5 (which
requires ZFS pool version 24 or later).
(Because we've been running ZFS for a rather long time now,
starting with Solaris 10, we actually have
some ZFS filesystems that are still version 4. Probably we should
schedule a '
zfs upgrade' one of these days, if only so all of our
filesystems are on the same version. All of our pools are recent
enough, since the pools were recreated in our move to our Linux
fileservers, but some of the
filesystems have been moved around with '
zfs send' since more or
less the beginning, which preserves at least some limitations of
the original filesystems.)
If you use '
zdb -v -O POOL PATH/TO/SYMLINK' to dump a modern,
system attribute based symbolic link, what you'll see is something
Object lvl iblk dblk dsize dnsize lsize %full type 2685091 1 128K 512 0 512 512 0.00 ZFS plain file 183 bonus System attributes dnode flags: USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED dnode maxblkid: 0 target ../target uid .. gid ... atime Mon Aug 30 22:06:38 2021 [etc]
What zdb reports as the '
target' attribute is the literal text
of the target of the symbolic link, as shown by eg '
ls -l' or
readlink. It comes directly from the relevant system
attribute, and is reported by cmd/zdb.c's
(Based on a quick look at the code, I don't think zdb can dump the older format of symlinks, although I may well be missing a zdb trick.)
PS: A sufficiently long symlink target will presumably overflow the amount of space available in the dnode bonus buffer and force the allocation of a spill block to hold some of the system attributes. I'm not sure how much space is normally available and I don't plan to dig further in the source (or do experiments) to find out. This isn't very different from other Unix filesystems; ext4 can only embed symlink targets in the inode if they're less than 60 bytes long, for example.
Some notes on OpenZFS's new 'draid' vdev redundancy type
One piece of recent ZFS news is that OpenZFS 2.1.0 contains a new type of vdev redundancy called 'dRAID', which is short for 'distributed RAID'. OpenZFS has a dRAID HOWTO that starts with this summary:
dRAID is a variant of raidz that provides integrated distributed hot spares which allows for faster resilvering while retaining the benefits of raidz. A dRAID vdev is constructed from multiple internal raidz groups, each with D data devices and P parity devices. These groups are distributed over all of the children in order to fully utilize the available disk performance. This is known as parity declustering and it has been an active area of research. [...]
However, there are some cautions about draid, starting with this:
Another way dRAID differs from raidz is that it uses a fixed stripe width (padding as necessary with zeros). This allows a dRAID vdev to be sequentially resilvered, however the fixed stripe width significantly effects both usable capacity and IOPS. For example, with the default D=8 and 4k disk sectors the minimum allocation size is 32k. If using compression, this relatively large allocation size can reduce the effective compression ratio. [...]
Needless to say, this also means that the minimum size of files (and symlinks, and directories) is 32 Kb, unless they're so small that they can perhaps be squeezed into bonus space in ZFS dnodes..
Another caution is that you apparently can't get draid's fast rebuild speed without having configured spare space in your draid setup. This is sort of implicitly present in the description of draid, when read to say that the integrated distributed hot spare space is what allows for faster resilvering. Since I believe that you can't reshape a draid vdev after creation, you had better include the spare space from the start; otherwise, you have something that's inferior to raidz with the same parity.
According to the Ars Technica article on draid, draid has been heavily tested (and hopefully heavily used in production) in "several major OpenZFS development shops". The Ars Technica article also has its own set of diagrams, and also additional numbers and information; it's well worth reading if you're potentially interested in draid, including for additional cautions about draid's survivability in the face of multi-device failures.
I don't think we're interested in draid any more than we're interested in raidz. Resilvering time is not our major concern with raidz, and draid keeps the other issues from raidz, like full stripe reads. In fact, I'm not sure very many people will be interested in draid. The Ars Technica article starts its conclusion with:
Distributed RAID vdevs are mostly intended for large storage servers—OpenZFS draid design and testing revolved largely around 90-disk systems. At smaller scale, traditional vdevs and spares remain as useful as they ever were.
dRAID is intellectually cool and I'm okay that OpenZFS has it, but I'm not sure it will ever be common, and as SATA/SAS SSDs and NVMe drives become more prevalent in storage servers, its advantages over raidz may increasingly go away except for high-capacity archival servers that still have to use HDs.
As an additional note, the actual draid data layout on disk is quite
complicated; Ars Technica points to the detailed comments in the
Given that ZFS stores locations on disk in the form of ZFS DVAs, which specify the vdev and the "byte
offset" into the vdev, you might wonder how DVA offsets work on
draid vdevs. Unfortunately I don't know because the answer appears
to be rather complicated based on
which isn't surprising given a complicated on disk layout. I suspect
that however draid maps DVA offsets has the same implications for
growing draid vdevs as it does for growing raidz ones (the coming raidz expansion is carefully set
up to cope with this).
The tiny irritation of ZFS's '
zpool status' nagging you about upgrades
One of the tiny irritations of operating ZFS for a long time is
that eventually, running '
zpool status' on your pools would produce
a multi-line nag about upgrading them to the latest version of ZFS.
I assume that this was added to '
zpool status' output so that you
wouldn't be unaware of it, but the size of the message was far too
large for its actual importance. Back in the old days of Solaris
zpool status -x' even included pools that could be upgraded
(this was one of our Solaris 10 update 6 gotchas), but fortunately people have gotten more
sensible since then. Now it's only a multi-line message.
Perhaps you think I'm exaggerating. No, really, here is the message from the latest version of OpenZFS:
status: Some supported and requested features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(7) for details.
That's five lines for what should be one line. When you have multiple pools on a system, as we do in our fileserver environment, it adds up fast.
There are various operational reasons why you might not want to upgrade pools right away. Historically we didn't want to upgrade pools until we were certain we were staying on the new OS and ZFS version, and then once we confident we were staying we weren't certain about the impact on our NFS servers. In theory pool upgrades should be transparent; in practice, who knows.
(Right now all of our fileserver pools are up to date in some sense, because they were freshly created on our current fileservers. But the ZFS version the fileservers are running is out of date, and when we upgrade them next year we'll run into this.)
Fortunately OpenZFS 2.1.0 provides a feature that lets you shut
this up, in the form of OpenZFS's support for partial upgrades. If you set the new '
property to what you already have, '
zpool status' won't nag you
zpool upgrade -v' will show you what you're missing).
A bit on ZFS's coming raidz expansion and ZFS DVAs
The ZFS news of the time interval is Ars Technica's report of raidz expansion potentially being added (via). More details and information about how it works are in the links in Matthew Ahrens' pull request, which as of yet hasn't landed in the master development version. I've previously written about ZFS DVAs and their effects on growing ZFS pools, in which I said that how DVA offsets are defined was by itself a good reason as to why you couldn't expand raidz vdevs (in addition to potential inefficiency). You might wonder how Ahrens' raidz expansion interacts with ZFS DVAs here, so that it can actually work.
As a quick summary, ZFS DVAs (Data Virtual Addresses, the ZFS equivalent of a block number) contain the byte offset of where in the entire vdev your block of data is found. In mirror vdevs (and plain disks), this byte offset is from the start of each disk. In raidz vdevs, it's striped sequentially across all disks; it starts with a chunk of disk 0, goes to a chunk of disk 1, and so on. One of the implications of this is that if you just add a disk to a raidz vdev and do nothing else, all of your striped sequential byte offsets change and you can no longer read your data.
How Ahrens' expansion deals with this is that it reflows all of the data on all of the existing drives to the new, wider raidz vdev layout, moving sectors around as necessary. Some of this reflowed data will wind up on the new drive (starting with the second sector of the first drive), but most of the data will wind up in other places on the existing drives. Both the Ars Technica article and Ahrens' slides from the 2021 FreeBSD Developer Summit have diagrams of this. The slides also share the detail that this is optimized to only copy the live data. This reflowing has the vital property that it preserves all of the DVA byte offsets, since it moves all data sectors from their old locations to where they should be in the new vdev layout.
(Thus, this raidz expansion is done without the long sought and so far mythical 'block pointer rewriting' that would allow general ZFS reshaping, including removing vdevs without the current layer of indirection.)
This copying is performed sector by sector and is blind to ZFS block boundaries. This means that raidz expansion doesn't verify checksums during the process because it doesn't know where they are. Since this expansion writes over the old data locations on your existing drives, I would definitely want to scrub your pool beforehand and have backups (to the extent that it's possible), just in case you hit previously latent disk errors during the expansion. And of course you should scrub the pool immediately after the expansion finishes.
As Ahrens' covers in the slides, this reflowing also doesn't expand the old blocks to be the full new width of the raidz vdev. As a result, they (still) have a higher parity overhead than newly written blocks would. To eliminate this overhead you need to explicitly force ZFS to rewrite all of the data in some way (and obviously this is impossible if you have snapshots that you can't delete and recreate).
Storing ZFS send streams is not a good backup method
One of the eternally popular ideas for people using ZFS is doing backups
by using '
zfs send' and storing the resulting send streams. Although
appealing, this idea is a mistake, because ZFS send streams do not
have the properties you want for a backup format.
A good backup format is designed for availability. No matter what happens, it should let you extract as much from it as possible, from both full backups and incremental backups. If your backup stream is damaged, you should still be able to find and restore as much as possible, both before and after the damage. If a full backup is missing or destroyed, you should still be able to recover something from whatever incrementals you have. This requires incremental backups to have more information in them than they specifically need, but that's a tradeoff you make for availability.
A better backup format should also be convenient to operate, and one big aspect of this is selective restores. A lot of the time you don't need to restore absolutely everything, you just want to get back one file or some files that you need because they got removed, damaged, or whatever. If you have to a complete restore (both full and incremental) in order get back a single file, you don't have a convenient backup format. Other nice things are, for example, being able to readily get an index of what is captured in any particular backup stream (full or incremental).
Incremental ZFS send streams do not have any of these properties and full ZFS send streams only have a few of them. Neither full nor incremental streams have any resilience against damage to the stream; a stream is either entirely intact or it's useless. Neither has selective restores or readily available indexes. Incremental streams are completely useless without everything they're based on. All of these issues will sooner or later cause you pain if you use ZFS streams as a backup format.
ZFS send streams are great at what they're for, which is replicating ZFS
filesystems from one ZFS pool to another in an environment where you can
immediately deal with any problems that come up (whether by retrying the
send of a corrupted stream, changing what it's based on, or whatever
you need to do). The further you pull '
zfs send' away from this happy
path, the more problems you're going to have.
(The design decisions of ZFS send streams make a great deal of sense for this purpose. As a replication format they're designed to be easy to generate, easy to receive, and compact, especially for incremental send streams. They have no internal redundancy or recovery from corruption because the best recovery is 'resend the stream to get a completely good one'.)