Wandering Thoughts

2021-08-30

How ZFS stores symbolic links on disk

After writing about ZFS's new 'draid' vdev topology, I wound up curious about how ZFS actually stores the target of symbolic links on disk (which matters for draid, because draid has a relatively large minimum block size). The answer turns out to tie back to another ZFS concept, System Attributes. As a quick summary, ZFS system attributes (SAs) are a way for ZFS to pack a more or less arbitrary collection of additional information, such as the parent directory of things, into ZFS dnodes. Normally this is done using extra space in dnodes that's called the bonus buffer, but it can overflow into a spill block if necessary.

The answer to how ZFS stores the target of symbolic links is that they are a System Attribute. You can see it listed as ZPL_SYMLINK in the enum of known system attributes in zfs_sa.h, along with a variety of other ones. There's also apparently an older scheme for storing these dnode attributes, which appears to use a more or less hard coded structure for them based on the znode_phys struct that's also defined in zfs_sa.h. You're only going to see this scheme if you have very old filesystems, because it was introduced in 2010 in ZFS filesystem version 5 (which requires ZFS pool version 24 or later).

(Because we've been running ZFS for a rather long time now, starting with Solaris 10, we actually have some ZFS filesystems that are still version 4. Probably we should schedule a 'zfs upgrade' one of these days, if only so all of our filesystems are on the same version. All of our pools are recent enough, since the pools were recreated in our move to our Linux fileservers, but some of the filesystems have been moved around with 'zfs send' since more or less the beginning, which preserves at least some limitations of the original filesystems.)

If you use 'zdb -v -O POOL PATH/TO/SYMLINK' to dump a modern, system attribute based symbolic link, what you'll see is something like this:

 Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
2685091    1   128K    512      0     512    512    0.00  ZFS plain file
                                            183   bonus  System attributes
  dnode flags: USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
  dnode maxblkid: 0
  target  ../target
  uid ..
  gid ...
  atime   Mon Aug 30 22:06:38 2021
[etc]

What zdb reports as the 'target' attribute is the literal text of the target of the symbolic link, as shown by eg 'ls -l' or reported by readlink. It comes directly from the relevant system attribute, and is reported by cmd/zdb.c's dump_znode_symlink().

(Based on a quick look at the code, I don't think zdb can dump the older format of symlinks, although I may well be missing a zdb trick.)

PS: A sufficiently long symlink target will presumably overflow the amount of space available in the dnode bonus buffer and force the allocation of a spill block to hold some of the system attributes. I'm not sure how much space is normally available and I don't plan to dig further in the source (or do experiments) to find out. This isn't very different from other Unix filesystems; ext4 can only embed symlink targets in the inode if they're less than 60 bytes long, for example.

ZFSHowSymlinksStored written at 22:32:31; Add Comment

2021-08-29

Some notes on OpenZFS's new 'draid' vdev redundancy type

One piece of recent ZFS news is that OpenZFS 2.1.0 contains a new type of vdev redundancy called 'dRAID', which is short for 'distributed RAID'. OpenZFS has a dRAID HOWTO that starts with this summary:

dRAID is a variant of raidz that provides integrated distributed hot spares which allows for faster resilvering while retaining the benefits of raidz. A dRAID vdev is constructed from multiple internal raidz groups, each with D data devices and P parity devices. These groups are distributed over all of the children in order to fully utilize the available disk performance. This is known as parity declustering and it has been an active area of research. [...]

However, there are some cautions about draid, starting with this:

Another way dRAID differs from raidz is that it uses a fixed stripe width (padding as necessary with zeros). This allows a dRAID vdev to be sequentially resilvered, however the fixed stripe width significantly effects both usable capacity and IOPS. For example, with the default D=8 and 4k disk sectors the minimum allocation size is 32k. If using compression, this relatively large allocation size can reduce the effective compression ratio. [...]

Needless to say, this also means that the minimum size of files (and symlinks, and directories) is 32 Kb, unless they're so small that they can perhaps be squeezed into bonus space in ZFS dnodes..

Another caution is that you apparently can't get draid's fast rebuild speed without having configured spare space in your draid setup. This is sort of implicitly present in the description of draid, when read to say that the integrated distributed hot spare space is what allows for faster resilvering. Since I believe that you can't reshape a draid vdev after creation, you had better include the spare space from the start; otherwise, you have something that's inferior to raidz with the same parity.

According to the Ars Technica article on draid, draid has been heavily tested (and hopefully heavily used in production) in "several major OpenZFS development shops". The Ars Technica article also has its own set of diagrams, and also additional numbers and information; it's well worth reading if you're potentially interested in draid, including for additional cautions about draid's survivability in the face of multi-device failures.

I don't think we're interested in draid any more than we're interested in raidz. Resilvering time is not our major concern with raidz, and draid keeps the other issues from raidz, like full stripe reads. In fact, I'm not sure very many people will be interested in draid. The Ars Technica article starts its conclusion with:

Distributed RAID vdevs are mostly intended for large storage servers—OpenZFS draid design and testing revolved largely around 90-disk systems. At smaller scale, traditional vdevs and spares remain as useful as they ever were.

dRAID is intellectually cool and I'm okay that OpenZFS has it, but I'm not sure it will ever be common, and as SATA/SAS SSDs and NVMe drives become more prevalent in storage servers, its advantages over raidz may increasingly go away except for high-capacity archival servers that still have to use HDs.

As an additional note, the actual draid data layout on disk is quite complicated; Ars Technica points to the detailed comments in the code. Given that ZFS stores locations on disk in the form of ZFS DVAs, which specify the vdev and the "byte offset" into the vdev, you might wonder how DVA offsets work on draid vdevs. Unfortunately I don't know because the answer appears to be rather complicated based on vdev_draid_xlate(), which isn't surprising given a complicated on disk layout. I suspect that however draid maps DVA offsets has the same implications for growing draid vdevs as it does for growing raidz ones (the coming raidz expansion is carefully set up to cope with this).

ZFSDRaidNotes written at 23:20:12; Add Comment

2021-07-25

The tiny irritation of ZFS's 'zpool status' nagging you about upgrades

One of the tiny irritations of operating ZFS for a long time is that eventually, running 'zpool status' on your pools would produce a multi-line nag about upgrading them to the latest version of ZFS. I assume that this was added to 'zpool status' output so that you wouldn't be unaware of it, but the size of the message was far too large for its actual importance. Back in the old days of Solaris 10, 'zpool status -x' even included pools that could be upgraded (this was one of our Solaris 10 update 6 gotchas), but fortunately people have gotten more sensible since then. Now it's only a multi-line message.

Perhaps you think I'm exaggerating. No, really, here is the message from the latest version of OpenZFS:

status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.

That's five lines for what should be one line. When you have multiple pools on a system, as we do in our fileserver environment, it adds up fast.

There are various operational reasons why you might not want to upgrade pools right away. Historically we didn't want to upgrade pools until we were certain we were staying on the new OS and ZFS version, and then once we confident we were staying we weren't certain about the impact on our NFS servers. In theory pool upgrades should be transparent; in practice, who knows.

(Right now all of our fileserver pools are up to date in some sense, because they were freshly created on our current fileservers. But the ZFS version the fileservers are running is out of date, and when we upgrade them next year we'll run into this.)

Fortunately OpenZFS 2.1.0 provides a feature that lets you shut this up, in the form of OpenZFS's support for partial upgrades. If you set the new 'compatibility' property to what you already have, 'zpool status' won't nag you (although 'zpool upgrade -v' will show you what you're missing).

ZFSZpoolStatusAndUpgrades written at 00:26:38; Add Comment

2021-06-20

A bit on ZFS's coming raidz expansion and ZFS DVAs

The ZFS news of the time interval is Ars Technica's report of raidz expansion potentially being added (via). More details and information about how it works are in the links in Matthew Ahrens' pull request, which as of yet hasn't landed in the master development version. I've previously written about ZFS DVAs and their effects on growing ZFS pools, in which I said that how DVA offsets are defined was by itself a good reason as to why you couldn't expand raidz vdevs (in addition to potential inefficiency). You might wonder how Ahrens' raidz expansion interacts with ZFS DVAs here, so that it can actually work.

As a quick summary, ZFS DVAs (Data Virtual Addresses, the ZFS equivalent of a block number) contain the byte offset of where in the entire vdev your block of data is found. In mirror vdevs (and plain disks), this byte offset is from the start of each disk. In raidz vdevs, it's striped sequentially across all disks; it starts with a chunk of disk 0, goes to a chunk of disk 1, and so on. One of the implications of this is that if you just add a disk to a raidz vdev and do nothing else, all of your striped sequential byte offsets change and you can no longer read your data.

How Ahrens' expansion deals with this is that it reflows all of the data on all of the existing drives to the new, wider raidz vdev layout, moving sectors around as necessary. Some of this reflowed data will wind up on the new drive (starting with the second sector of the first drive), but most of the data will wind up in other places on the existing drives. Both the Ars Technica article and Ahrens' slides from the 2021 FreeBSD Developer Summit have diagrams of this. The slides also share the detail that this is optimized to only copy the live data. This reflowing has the vital property that it preserves all of the DVA byte offsets, since it moves all data sectors from their old locations to where they should be in the new vdev layout.

(Thus, this raidz expansion is done without the long sought and so far mythical 'block pointer rewriting' that would allow general ZFS reshaping, including removing vdevs without the current layer of indirection.)

This copying is performed sector by sector and is blind to ZFS block boundaries. This means that raidz expansion doesn't verify checksums during the process because it doesn't know where they are. Since this expansion writes over the old data locations on your existing drives, I would definitely want to scrub your pool beforehand and have backups (to the extent that it's possible), just in case you hit previously latent disk errors during the expansion. And of course you should scrub the pool immediately after the expansion finishes.

As Ahrens' covers in the slides, this reflowing also doesn't expand the old blocks to be the full new width of the raidz vdev. As a result, they (still) have a higher parity overhead than newly written blocks would. To eliminate this overhead you need to explicitly force ZFS to rewrite all of the data in some way (and obviously this is impossible if you have snapshots that you can't delete and recreate).

ZFSRaidzExpansionAndDVAs written at 00:59:49; Add Comment

2021-05-09

Storing ZFS send streams is not a good backup method

One of the eternally popular ideas for people using ZFS is doing backups by using 'zfs send' and storing the resulting send streams. Although appealing, this idea is a mistake, because ZFS send streams do not have the properties you want for a backup format.

A good backup format is designed for availability. No matter what happens, it should let you extract as much from it as possible, from both full backups and incremental backups. If your backup stream is damaged, you should still be able to find and restore as much as possible, both before and after the damage. If a full backup is missing or destroyed, you should still be able to recover something from whatever incrementals you have. This requires incremental backups to have more information in them than they specifically need, but that's a tradeoff you make for availability.

A better backup format should also be convenient to operate, and one big aspect of this is selective restores. A lot of the time you don't need to restore absolutely everything, you just want to get back one file or some files that you need because they got removed, damaged, or whatever. If you have to a complete restore (both full and incremental) in order get back a single file, you don't have a convenient backup format. Other nice things are, for example, being able to readily get an index of what is captured in any particular backup stream (full or incremental).

Incremental ZFS send streams do not have any of these properties and full ZFS send streams only have a few of them. Neither full nor incremental streams have any resilience against damage to the stream; a stream is either entirely intact or it's useless. Neither has selective restores or readily available indexes. Incremental streams are completely useless without everything they're based on. All of these issues will sooner or later cause you pain if you use ZFS streams as a backup format.

ZFS send streams are great at what they're for, which is replicating ZFS filesystems from one ZFS pool to another in an environment where you can immediately deal with any problems that come up (whether by retrying the send of a corrupted stream, changing what it's based on, or whatever you need to do). The further you pull 'zfs send' away from this happy path, the more problems you're going to have.

(The design decisions of ZFS send streams make a great deal of sense for this purpose. As a replication format they're designed to be easy to generate, easy to receive, and compact, especially for incremental send streams. They have no internal redundancy or recovery from corruption because the best recovery is 'resend the stream to get a completely good one'.)

(This comes up on the ZFS on Linux mailing list periodically and I write replies (eg, also), so it's time to write this down in an entry.)

ZFSSendNotABackup written at 00:01:12; Add Comment

2021-03-06

Some views and notes on ZFS deduplication today

I recently wrote an entry about a lingering sign of old hopes for ZFS deduplication, and got a number of good comments on it that I have reactions and views about. First off, Opk said:

I could be very wrong in my understanding of how zfs dedup works but I've often turned it on for the initial data population. So I turn it on, rsync or zfs send in my data and then I turn it off again. I don't care about memory usage during the initial setup of a system so I assume this is not doing much harm. [...]

How much potential harm this does depends on what you do with the data that was written with deduplication on. If you leave the data sitting there, this is relatively harmless. However, if you delete the data (including overwriting data in files 'in place'), then ZFS must update the DDT (deduplication table) to correctly maintain the reference count of each unique data block. If you don't have enough memory to hold all of the DDT, then this is going to require disk reads to page chunks of it in and out. The amount of reading and slowdown goes up as you delete more and more data at once, for example if you delete an entire snapshot or filesystem.

(This is a classical surprise issue with deduplication, going back to early days. People are disconcerted when operations like 'zfs destroy <snapshot>' sit there for ages, or at least run in the background for ages even if the command returns immediately.)

Brendan Long asked:

Have the massive price drops for SSD's since 2010 change your opinion on this at all? It seems like the performance hit is quite bad if you have to do random seeks on a spinning disk, but it's ok on SSD's, and you can get a 100 GB SSD for $20 these days.

I'm not sure if it's okay on SSDs, so here's my view. Reads aren't slowed by being deduplicated, but writes (and deletes) require a synchronous check of the DDT for every block, which means a synchronous SSD read IO if the necessary section of the DDT isn't in RAM. It's not clear to me what latency SSDs have for isolated synchronous reads, but my vaguely measured numbers suggest that we should assume on at least a couple of milliseconds per read.

I haven't read the ZFS code, so I don't know if it performs DDT checking serially as it processes each block being written or deleted (which would be a natural approach), or if it somehow batches the checks up to issue them in parallel. If DDT checks are fully serial and you have to go to SSD on each one, you're looking at a write or delete rate of at most a thousand blocks a second. If you're dealing with 128 KB blocks (the typical maximum ZFS recordsize), that works out to about 125 MBytes a second. This is okay but not all that impressive for a SSD, and it would mean that deleting large objects could still take quite a while to complete.

(Deleting 100 GB might take over 13 minutes, for example.)

On the other hand, if we assume that a typical SATA 6 GB/s SSD has a sustained write bandwidth of 550 Mbytes/sec, you only need around 4,400 DDT checks a second in order to hit that data rate for writing out 128 KB ZFS blocks. In practice you're probably not going to get 550 Mbytes/sec of user level write bandwidth out of a deduplicated ZFS pool on a single SSD, because both the necessary DDT writes and the DDT reads will take up some of the bandwidth to and from the SSD (even if the DDT is entirely in RAM, it gets updated on writes and deletes and those updates have to be written back to the SSD).

(This also implies that 4,400 written out DDT blocks a second is about the maximum you can do on a single SSD, for deletes. But I expect that writing out updated DDT entries for deletes is batched and generally doesn't touch that many different blocks of the DDT.)

On the whole, I think that there are enough uncertainties about the performance of deduplicated ZFS pools even on SSDs that I wouldn't want to build one for general use. I'd say 'without a good amount of testing', but I'm not sure that testing would convince me that I wouldn't run into a corner case in ordinary use after long enough.

ZFSDedupTodayNotes written at 23:07:38; Add Comment

2021-02-19

ZFS pool partial (selective) feature upgrades are coming in OpenZFS

I'm an active user of (Open)ZFS on Linux on my personal machines (office workstation and home Linux machine), where I deliberately run the very latest development versions. But if you ran 'zpool status' on either machine, you would see a lot of:

status: Some supported and requested features are not
        enabled on the pool. The pool can still be used,
        but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once
        this is done, the pool may no longer be accessible
        by software that does not support the features.
        See zpool-features(5) for details.

(This verbose message irks me for other reasons.)

The reason for this is exactly that I run the latest development versions. Right now, if you run 'zpool upgrade' you get no choice about what happens; your pools are upgraded to support absolutely all of the features that the code you're running knows about. The same thing happens by default when you create a pool (although you can specify exact features if you know what you're doing). For people like me, who have old pools but are running the very latest development versions, this is dangerous. I don't want to enable ZFS pool features that aren't enabled in any released version yet in case I have to revert back to using a stable, released version of (Open)ZFS.

The good news is that OpenZFS's development version just landed a fix for this, in fact a very general one. The simple version is that there's a new ZFS pool property called 'compatibility'; if set, it limits what features a pool will be created with or upgraded to. You can set it to a wide variety of general choices, which include things like 'OpenZFS 2.0 on Linux' and 'what Grub2 will support'.

(As a side effect, looking at the files that defines these options will tell you what's supported, or believed to be supported, on various platforms.)

Since this is a ZFS pool property, I believe that the way to selectively (or partially) upgrade an existing pool created with no compatibility option set is to do 'zpool set compatibility=...' to whatever before hand and then run 'zpool upgrade'. This is somewhat underdocumented right now (from my perspective) and I rather wish that 'zpool upgrade' itself could take what to upgrade to (well, what to limit upgrades to) as an explicit argument, the way 'zpool create' does. I suppose that setting a pool property (and then leaving it there) is safer than relying on never accidentally running a 'zpool upgrade' with no restrictions.

Another useful side effect of the compatibility pool property, at least according to the documentation, is that apparently 'zpool status' will no longer nag you if your pool supports all of the features that are allowed by its compatibility setting, even if that's a very low one. This may finally get 'zpool status' to shut up about this for me, someday.

(I won't be taking advantage of this feature to finally upgrade my pools to the OpenZFS 2.0 level until a bit more time has passed and other people have found any problems with it. The development version of ZFS is well tested, but I'm still cautious.)

This is a quality of life improvement that many OpenZFS users will never really notice, but for system administrators and people like me it's going to be great. Since one of the compatibility options is 'Grub2', it will probably also help people on Linux who want to use ZFS for their root filesystem.

PS: I don't know when (or if) this will be merged back into Illumos. I don't believe that OpenZFS is attempting to explicitly drive this; instead I believe they leave it up to the Illumos developers to pull in OpenZFS changes of interest. As far as Linux goes, I suspect that this won't be part of any 2.0.x update and will likely wait for 2.1.0, whenever that happens.

ZFSPartialUpgradeOption written at 23:27:24; Add Comment

2021-01-24

Thinking through what can go badly with databases on ZFS

Famously, if you're running a database with its storage on ZFS and you care about performance, you need to tune various ZFS parameters for the filesystem (or filesystems) that the database is on. You especially need to tune the ZFS recordsize property; generally people will say that if you change only one thing, you should change this to be either the same size as your database's block size or perhaps twice its size. But this raises a question for a certain sort of person, namely what goes badly when you leave ZFS's recordsize alone and run a database anyway. I can't answer this from experiments and experience (we've never tried to run performance sensitive databases on our ZFS fileservers), but I can work through this based on knowledge of how ZFS works. I'm going to assume SSD or NVMe storage; if you're still running a database on spinning rust and trying for performance, ZFS's recordsize setting is the least of your problems.

(Examples of tuning recommendations include this [PDF] (via) or Let's Encrypt's ZFS datastore for MariaDB (via).)

The default ZFS recordsize is 128 Kb. What this means is that once a file is 128 Kb or larger, it's stored in logical blocks that are 128 Kb in size (this is the size before compression, so the physical size on disk may vary). Within ZFS, both reads and writes must be done to entire (logical) blocks at once, even if at the user level you only want to read or write a small amount of data. This 128 Kb logical block IO forces overheads on both database reads and especially database writes.

For reads, ZFS must transfer up to 128 Kb from disk (although in a single IO transaction), checksum the entire (decompressed) 128 Kb, probably hold it in the ARC (ZFS's in kernel disk cache), and finally give the database the 8 Kb or 16 Kb chunk that it really wants. I suspect that what usually hurts the most here is the extra memory overhead (assuming that the database doesn't then go back and want another 8 Kb or 16 Kb chunk out of the same 128 Kb block, which is now ready in memory). SSDs and especially NVMe drives have high bandwidth and support a lot of operations per second, so the extra data transferred probably doesn't have a big effect there, although the extra data transferred, decompressed, and checksummed may increase your read IO latency a bit.

Things are worse for database writes. To update an 8 Kb or 16 Kb chunk, ZFS must read the 128 Kb block into memory if it's not already there (taking the read overheads, including latency), checksum and likely compress the new version of the 128 Kb block, allocate new disk space for it all, and write it. Importantly, the same read, modify, and write process is required most of the time if you're appending to a file, such as a database's write-ahead log. When the database fsync()s its data (either for its log or for the main data files), ZFS may also write the full data into the ZFS Intent Log. Because a fsync() forces the disk to flush data to durable storage and the time this takes usually depends on how much data there is to flush, I think the increased data written to the ZIL will increase fsync() latency and thus transaction commit latency.

(It's not clear to me if a partial write of a block in a file that has hit the full recordsize writes only the new user-level data to the ZIL or if the ZIL includes the full block, probably out of line but still forced to disk.)

On modern SSDs and NVMe drives, there's a limited internal drive cache of fast storage for buffering writes before they have to be put on the slower main flash. If your database has a high enough write volume, the extra data that has to be written with a 128 Kb recordsize might push the drive out of that fast write storage and slow down all writes. I suspect that most people don't have that much write traffic and that this isn't a real concern; my impression is that people normally hit this drive limit with sustained asynchronous writes.

PS: Appending a small amount of data to a file that is 128 Kb or larger usually requires the same read, modify, write cycle because the last block of a file is still 128 Kb even if the file doesn't entirely fill it up. You get to skip the overhead only when you're starting a new 128 Kb block; if you're appending in 16 Kb chunks, this is every 8th chunk.

PPS: I have some thoughts about the common recommendation for a logbias of throughput on modern storage, but that needs another entry. The short version is that what throughput really does is complicated and it may not be to your benefit today on devices where random IO is free and write bandwidth is high.

(This entry was sparked by this Fediverse toot, although it doesn't in the least answer the toot's question.)

ZFSDatabasesWhatHappens written at 00:47:58; Add Comment

2021-01-21

A lingering sign of old hopes for ZFS deduplication

Over on Twitter, I said:

It's funny-sad that ZFS dedup was considered such an important feature when it launched that 'zpool list' had a DEDUP field added, even for systems with no dedup ever enabled. Maybe someday zpool status will drop that field in the default output.

For people who have never seen it, here is 'zpool list' output on a current (development) version of OpenZFS on Linux:

; zpool list
NAME     SIZE  ALLOC  FREE  CKPOINT  EXPANDSZ  FRAG  CAP  DEDUP  HEALTH  ALTROOT
ssddata  596G   272G  324G        -         -   40%  45%  1.00x  ONLINE  -

The DEDUP field is the ratio of space saved by deduplication, expressed as a multiplier (from the allocated space after deduplication to what it would be without deduplication). It's always present in default 'zpool list' output, and since almost all ZFS pools don't use deduplication, it's almost always 1.00x.

It seems very likely that Sun and the Solaris ZFS developers had great hope for ZFS deduplication when the feature was initially launched. Certainly the feature was very attention getting and superficially attractive; back a decade ago, people had heard of it and would recommend it casually, although actual Solaris developers were more nuanced. It seems very likely that the presence of a DEDUP field in the default 'zpool list' output is a product of an assumption that ZFS deduplication would be commonly used and so showing the field was both useful and important.

However, things did not turn out that way. ZFS deduplication is almost never used, because once people tried to use it for real they discovered that it was mostly toxic, primarily because of high memory requirements. Yet the DEDUP field lingers on in the default 'zpool list' output, and people like me can see it as a funny and sad reminder of the initial hopes for ZFS deduplication.

(OpenZFS could either remove it or, if possible, replace it with the overall compression ratio multiplier for the pool, since many pools these days turn on compression. You would still want to have DEDUP available as a field in some version of 'zpool list' output, since the information doesn't seem to be readily available anywhere else.)

PS: Since I looked it up, ZFS deduplication was introduced in Oracle Solaris 11 for people using Solaris, which came out in November of 2011. It was available earlier for people using OpenSolaris, Illumos, and derivatives. Wikipedia says that it was added to OpenSolaris toward the end of 2009 and first appeared in OpenSolaris build 128, released in early December of 2009.

ZFSDedupLingeringSign written at 23:01:10; Add Comment

2020-12-21

The legibility of different versions of ZFS

I'll put the summary right up at the front: one of the refreshing things that I enjoy about OpenZFS on Linux is how comparatively legible and accessible some aspects of its operation are to me. Well, specifically how comparatively legible starting up ZFS on boot is. Now, there are two sides to that. On one side, the Linux setup to start ZFS is complicated. On the other side, this complexity has always existed in ZFS, it's just that on Solaris (and Illumos/OmniOS), the complexity was deliberately hidden away from you. You were not supposed to have to care about how ZFS started on Solaris because the deep integration of ZFS with the rest of the system should make it just work.

This was fine until the time when it didn't just work. We had some of those at some points, and because we had some of those (and as a general precaution), I wanted to understand the whole process more. When we were running Solaris and then OmniOS, I mostly failed. I never fully understood things like ZFS pool activation and iSCSI or boot time pool activation (also). I'm sure that these things are knowable, and I suspect that they are knowable even for people who aren't Illumos ZFS kernel developers, but I was never able to navigate through everything while we were still running OmniOS.

Given how we overlooked syseventadm, it's quite possible that part of what is going on is that I'm more familiar with Linux boot arcana than I am with Illumos boot arcana. I certainly like systemd much more than SMF, which left me more interested in learning systemd things than in learning SMF ones. And OpenZFS on Linux has no more documentation on the ZFS boot process in Linux than Illumos had for its ZFS boot process the last time I looked, so you're at the mercy of third party documentation like the Arch wiki. But for whatever reasons, I've been more successful at figuring out the Linux ZFS boot process than I ever was with the OmniOS one.

(Illumos also comes pre-set to have all of the ZFS things work while OpenZFS on Linux can leave you to configure things yourself, which is not exactly the greatest experience.)

I do wish that these things were documented (for Illumos and OpenZFS both). They don't have to be officially supported as 'how this will be for all time', but just knowing how things are supposed to work can be a great help when you run into problems. And beyond that, it's good to know more about how your systems operate under the surface. In the end there is no magic, only things that you don't know.

ZFSVersionsLegibility written at 22:30:48; Add Comment

(Previous 10 or go back to November 2020 at 2020/11/29)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.