Wandering Thoughts


Thinking about why ZFS only does IO in recordsize blocks, even random IO

As I wound up experimentally verifying, in ZFS all files are stored as a single block of varying size up to the filesystem's recordsize, or using multiple recordsize blocks. As is perhaps less well known, a ZFS logical block is the minimum size of IO to a file, both for reads and especially for writes. Since the default recordsize is 128 Kb, this means that many files of interest are have recordsize blocks and thus all IO to them is done in 128 Kb units, even if you're only reading or writing a small amount of data.

On the one hand, this seems a little bit crazy. The time it takes to transfer 128 Kb over a SATA link is not always something that you can ignore, and on SSDs larger writes can have a real impact. On the other hand, I think that this choice is more or less forced by some decisions that ZFS has made. Specifically, the ZFS checksum covers the entire logical block, and ZFS's data structure for 'where you find things on disk' is also based on logical blocks.

I wrote before about the ZFS DVA, which is ZFS's equivalent of a block number and tells you where to find data. ZFS DVAs are embedded into 'block pointers', which you can find described in spa.h. One of the fields of the block pointer is the ZFS block checksum. Since this is part of the block pointer, it is a checksum over all of the (logical) data in the block, which is up to recordsize. Once a file reaches recordsize bytes long, all blocks are the same size, the recordsize.

Since the ZFS checksum is over the entire logical block, ZFS has to fetch the entire logical block in order to verify the checksum on reads, even if you're only asking for 4 Kbytes out of it. For writes, even if ZFS allowed you to have different sized logical blocks in a file, you'd need to have the original recordsize block available in order to split it and you'd have to write all of it back out (both because ZFS never overwrites in place and because the split creates new logical blocks, which need new checksums). Since you need to add new logical blocks, you might have a ripple effect in ZFS's equivalent of indirect blocks, where they must expand and shuffle things around.

(If you're not splitting the logical block when you write to only a part of it, copy on write means that there's no good way to do this without rewriting the entire block.)

In fact, the more I think about this, the more it seems that having multiple (logical) block sizes in a single file would be the way to madness. There are so many things that get complicated if you allow variable block sizes. These issues can be tackled, but it's simpler not to. ZFS's innovation is not that it insists that files have a single block size, it is that it allows this block size to vary. Most filesystems simply set the block size to, say, 4 Kbytes, and live with how large files have huge indirect block tables and other issues.

(The one thing that might make ZFS nicer in the face of some access patterns where this matters is the ability to set the recordsize on a per-file basis instead of just a per-filesystem basis. But I'm not sure how important this would be; the kind of environments where it really matters are probably already doing things like putting database tables on their own filesystems anyway.)

PS: This feels like an obvious thing once I've written this entry all the way through, but the ZFS recordsize issue has been one of my awkward spots for years, where I didn't really understand why it all made sense and had to be the way it was.

PPS: All of this implies that if ZFS did split logical blocks when you did a partial write, the only time you'd win would be if you then overwrote what was now a single logical block a second time. For example, if you created a big file, wrote 8 Kb to a spot in it (splitting a 128 Kb block into several new logical blocks, including an 8 Kb one for the write you just did), then later wrote exactly 8 Kb again to exactly that spot (overwriting only your new 8 Kb logical block). This is probably obvious too but I wanted to write it out explicitly, if only to convince myself of the logic.

ZFSWhyIOInRecordsize written at 01:07:47; Add Comment


Some exciting ZFS features that are in OmniOS CE's (near) future

I recently wrote about how much better ZFS pool recovery is coming, which reported on Pavel Zakharov's Turbocharging ZFS Data Recovery. In that, Zakharov said that the first OS to get it would likely be OmniOS CE, although he didn't have a timeline. Since I just did some research on this, let's run down some exciting ZFS features that are almost certainly in OmniOS CE's near future, and where they are.

There are two big ZFS features from Delphix that have recently landed in the main Illumos tree, and a third somewhat smaller one:

  • This better ZFS pool recovery, which landed as a series of changes culminating in issue 9075 in February or so. Although I can't be sure, I believe that a recovered pool is fully compatible with older ZFS versions, although for major damage you're going to be copying data out of the pool to a new one.

  • The long awaited feature of shrinking ZFS pools by removing vdevs, which landed as issue 7614 in January. Using this will add a permanent feature flag to your pool that makes it fully incompatible with older ZFS versions.

  • A feature for checkpointing the overall ZFS pool state before you do potentially dangerous operations and can then rewind to them, issue 9166, which landed just a few days ago. Since one of the purposes of better ZFS pool recovery is to provide (better) recovery over pool configuration changes, I suspect that this new pool checkpointing helps out with it. This makes your pool relatively incompatible with older ZFS versions while a checkpoint exist.

(Apparently Delphix was only able to push these upstream from their own code base to Illumos and OpenZFS relatively recently.)

All of these features have been pulled into the 'master' branch of the OmniOS CE repository from the main Illumos repo where they landed. Unless something unusual happens, I would expect them all to be included in the next release of OmniOS CE, which their release schedule says is to be r151026, expected some time this May. This will not be an LTS release; if you want to wait for an LTS release to have these features, you're waiting until next year. Given the likely magnitude of these changes and the relatively near future release of r151026, I wouldn't expect OmniOS CE to include these in a future update to the current r151024 or especially r151022 LTS.

Since OmniOS CE Bloody integrates kernel and user updates on a regular basis, I suspect that it already has many of these features and will pick up the most recent one very soon. If this is so, it gives OmniOS people a clear path if they need to recover a damaged pool; you can boot a Bloody install media or otherwise temporarily run it, repair or import the pool, possibly copying the data to another pool, and then probably revert back to running your current OmniOS with the recovered pool.

Sidebar: How to determine this sort of stuff

The most convenient way is to look at the the git log for commits that involve, say, usr/src/uts/common/fs/zfs, the kernel ZFS code, in the OmniOS CE repo. In the Github interface, this is drilling down to that directory and then picking the 'History' option; a convenient link for this for the 'master' branch is here.

Each OmniOS CE release gets its own branch in the repo, named in the obvious way, and each branch thus has its own commit history for ZFS. Here is the version for r151024. Usefully, Github forms the URLs for these things in a very predictable way, making it very easy to hand-write your own URLs for specific things (eg, the same for r151022 LTS, which shows only a few recent ZFS changes).

There's no branch for OmniOS CE Bloody, so I believe that it's simply built from the 'master' branch. It is the bleeding edge version, after all.

ZFSOmniosCEComingChanges written at 01:24:50; Add Comment


Much better ZFS pool recovery is coming (in open source ZFS)

One of the long standing issues in ZFS has been that while it's usually very resilient, it can also be very fragile if the wrong things get damaged. Classically, ZFS has had two modes of operation; either it would repair any damage or it would completely explode. There was no middle ground of error recovery, and this isn't a great experience; as I wrote once, panicing the system is not an error recovery strategy. In early versions of ZFS there was no recovery at all (you restored from backups); in later versions, ZFS added a feature where you could attempt to recover from damaged metadata by rewinding time, which was better than nothing but not a complete fix.

The good news is that that's going to change, and probably not too long from now. What you want to read about this is Turbocharging ZFS Data Recovery, by Pavel Zakharov of Delphix, which covers a bunch of work that he's done to make ZFS more resilient and more capable of importing various sorts of damaged pools. Of particular interest is the ability to likely recover at least some data from a pool that's lost an entire vdev. You can't get everything back, obviously, but ZFS metadata is usually replicated on multiple vdevs so losing a single vdev will hopefully leave you with enough left to at least get the rest of the data out of the pool.

(I saw this article via, itself via a retweet by @richardelling.)

All of this is really great news. ZFS has long needed better options for recovery from various pool problems, as well as better diagnostics for failed pool imports, and I'm quite happy that the situation is finally going to be improving.

The article is also interesting for its discussion of the current low level issues involved in pool importing. For example, until I read it I had no idea about how potentially dangerous a ZFS pool vdev change was due to how pool configurations are handled during the import process. I'd love to read more details on how pool importing really works and what the issues are (it's a long standing interest of mine), but sadly I suspect that no one with that depth of ZFS expertise has the kind of time it would take to write such an article.

As far as the timing of these features being available in your ZFS-using OS of choice goes, his article says this:

As of March 2018, it has landed on OpenZFS and Illumos but not yet on FreeBSD and Linux, where I’d expect it to be upstreamed in the next few months. The first OS that will get this feature will probably be OmniOS Community Edition, although I do not have an exact timeline.

If you have a sufficiently important damaged pool, under some circumstances it may be good enough if there is some OS, any OS, that can bring up the pool to recover the data in it. For all that I've had my issues with OmniOS's hardware support, OmniOS CE does have fairly decent hardware support and you can probably get it going on most modern hardware in an emergency.

(And if OmniOS can't talk directly to your disk hardware, there's always iSCSI, as we can testify. There's probably also other options for remote disk access that OmniOS ZFS and zdb can deal with.)

PS: If you're considering doing this in the future and your normal OS is something other than Illumos, you might want to pay attention to the ZFS feature flags you allow to be set on your pool, since this won't necessarily work if your pool uses features that OmniOS CE doesn't (yet) support. This is probably not going to be an issue for FreeBSD but might be an issue for ZFS on Linux. You probably want to compare the ZoL manpage on ZFS pool features with the Illumos version or even the OmniOS CE version.

Sidebar: Current ZFS pool feature differences

The latest Illumos tree has three new ZFS pool features from Delphix: device_removal, obsolete_counts (which enhances device removal), and zpool_checkpoint. These are all fairly recent additions; they appear to have landed in the Illumos tree this January and just recently, although the commits that implement them are dated from 2016.

ZFS on Linux has four new pool features: large_dnode, project_quota, userobj_accounting, and encryption. Both large dnodes and encryption have to be turned on explicitly, and the other two are read-only compatible, so in theory OmniOS can bring a pool up read-only even with them enabled (and you're going to want to have the pool read-only anyway).

ZFSPoolRecoveryComing written at 00:54:35; Add Comment


Some things about ZFS block allocation and ZFS (file) record sizes

As I wound up experimentally verifying, in ZFS all files are stored as a single block of varying size up to the filesystem's recordsize, or using multiple recordsize blocks. For a file under the recordsize, the block size turns out to be in a multiple of 512 bytes, regardless of the pool's ashift or the physical sector size of the drives the pool is using.

Well, sort of. While everything I've written is true, it also turns out to be dangerously imprecise (as I've seen before). There are actually three different sizes here and the difference between them matters once we start getting into the fine details.

To talk about these sizes, I'll start with some illustrative zdb output for a file data block, as before:

 0 L0 DVA[0]=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]

The first size of the three is the logical block size, before compression. This is the first size= number ('4200L' here, in hex and L for logical). This is what grows in 512-byte units up to the recordsize and so on.

The second size is the physical size after compression, if any; this is the second size= number ('4200P' here, P for physical). It's a bit weird. If the file can't be compressed, it is the same as the logical size and because the logical size goes in 512-byte units, so does this size, even on ashift=12 pools. However, if compression happens this size appears to go by the ashift, which means it doesn't necessarily go in 512-byte units. On an ashift=9 pool you'll see it go in 512-byte units (so you can have a compressed size of '400P', ie 1 KB), but the same data written in an ashift=12 pool winds up being in 4 Kb units (so you wind up with a compressed size of '1000P', ie 4 Kb).

The third size is the actual allocated size on disk, as recorded in the DVA's asize field (which is the third subfield in the DVA[0] portion). This is always in ashift-based units, even if the physical size is not. Thus you can wind up with a 20 KB DVA but a 16.5 KB 'physical' size, as in our example (the DVA is '5000' while the block physical size is '4200').

(I assume this happens because ZFS insures that the physical size is never larger than the logical size, although the DVA allocated size may be.)

For obvious reasons, it's the actual allocated size on disk (the DVA asize) that matters for things like rounding up raidz allocation to N+1 blocks, fragmentation, and whether you need to use a ZFS gang block. If you write a 128 KB (logical) block that compresses to a 16 KB physical block, it's 16 KB of (contiguous) space that ZFS needs to find on disk, not 128 KB.

On the one hand, how much this matters depends on how compressible your data is and much modern data isn't (because it's already been compressed in its user-level format). On the other hand, as I found out, 'sparse' space after the logical end of file is very compressible. A 160 KB file on a standard 128 KB recordsize filesystem takes up two 128 KB logical blocks, but the second logical block has 96 KB of nothingness at the end and that compresses down to almost nothing.

PS: I don't know if it's possible to mix vdevs with different ashifts in the same pool. If it is, I don't know how ZFS would decide what ashift to use for the physical block size. The minimum ashift in any vdev? The maximum ashift?

(This is the second ZFS entry in a row where I thought I knew what was going on and it was simple, and then discovered that I didn't and it isn't.)

ZFSLogicalVsPhysicalBlockSizes written at 00:49:29; Add Comment


A surprise in how ZFS grows a file's record size (at least for me)

As I wound up experimentally verifying, in ZFS all files are stored as a single block of varying size up to the filesystem's recordsize, or using multiple recordsize blocks. If a file has more than one block, all blocks are recordsize, no more and no less. If a file is a single block, the size of this block is based on how much data has been written to the file (or technically the maximum offset that's been written to the file). However, how the block size grows as you write data to the file turns out to be somewhat surprising (which makes me very glad that I actually did some experiments to verify what I thought I knew before I wrote this entry, because I was very wrong).

Rather than involving the ashift or growing in powers of two, ZFS always grows the (logical) block size in 512-byte chunks until it reaches the filesystem recordsize. The actual physical space allocated on disk is in ashift sized units, as you'd expect, but this is not directly related to the (logical) block size used at the file level. For example, here is a 16896 byte file (of incompressible data) on an ashift=12 pool:

 Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
4780566    1   128K  16.5K    20K     512  16.5K  100.00  ZFS plain file
0 L0 DVA[0]=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]

The DVA records an 0x5000 byte allocation (20 Kb), but the logical and physical-logical size are only 0x4200 bytes (16.5 Kb).

In thinking about it, this makes a certain amount of sense because the ashift is really a vdev property, not a pool property, and can vary from vdev to vdev within a single pool. As a result, the actual allocated size of a given block may vary from vdev to vdev (and a block may be written to multiple vdevs if you have copies set to more than 1 or it's metadata). The file's current block size thus can't be based on the ashift, because ZFS doesn't necessarily have a single ashift to base it on; instead ZFS bases it on 512-byte sectors, even if this has to be materialized differently on different vdevs.

Looking back, I've already sort of seen this with ZFS compression. As you'd expect, a file's (logical) block size is based on its uncompressed size, or more exactly on the highest byte offset in the file. You can write something to disk that compresses extremely well, and it will still have a large logical block size. Here's an extreme case:

; dd if=/dev/zero of=testfile bs=128k count=1
# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile

 Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
956361    1   128K   128K      0     512   128K    0.00  ZFS plain file

This turns out to have no data blocks allocated at all, because the 128 Kb of zeros can be recorded entirely in magic flags in the dnode. But it still has a 128 Kb logical block size. 128 Kb of the character 'a' does wind up requiring a DVA allocation, but the size difference is drastic:

Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
956029    1   128K   128K     1K     512   128K  100.00  ZFS plain file
0 L0 DVA[0]=<0:3bbd1c00:400> [L0 ZFS plain file] [...] size=20000L/400P [...]

We have a compressed size of 1 Kb (and a 1 Kb allocation on disk, as this is an ashift=9 vdev), but once again the file block size is 128 Kb.

(If we wrote 127.5 Kb of 'a' instead, we'd wind up with a file block size of 127.5 Kb. I'll let interested parties do that experiment themselves.)

What this means is that ZFS has much less wasted space than I thought it did for files that are under the recordsize. Since such files grow their logical block size in 512-byte chunks, even with no compression they waste at most almost all of one physical block on disk (if you have a file that is, say, 32 Kb plus one byte, you'll have a physical block on disk with only one byte used). This has some implications for other areas of ZFS, but those are for another entry.

(This is one of those entries that I'm really glad that I decided to write. I set out to write it as a prequel to another entry just to have how ZFS grew the block size of files written down explicitly, but wound up upending my understanding of the whole area. The other lesson for me is that verifying my understanding with experiments is a really good idea, because every so often my folk understanding is drastically wrong.)

ZFSRecordsizeGrowth written at 22:28:55; Add Comment


What ZFS gang blocks are and why they exist

If you read up on ZFS internals, sooner or later you will run across references to 'gang blocks'. For instance, they came up when I talked about what's in a DVA, where DVAs have a flag to say that they point to a gang block instead of a regular block. Gang blocks are vaguely described as being a way of fragmenting a large logical block into a bunch of separate sub-blocks.

A more on-point description can be found in the (draft) ZFS on-disk specification (PDF, via) or the source code comments about them in zio.c. I'll selectively quote from zio.c because it's easier to follow:

A gang block is a collection of small blocks that looks to the DMU like one large block. When zio_dva_allocate() cannot find a block of the requested size, due to either severe fragmentation or the pool being nearly full, it calls zio_write_gang_block() to construct the block from smaller fragments.

A gang block consists of a gang header and up to three gang members. The gang header is just like an indirect block: it's an array of block pointers. It consumes only one sector and hence is allocatable regardless of fragmentation. The gang header's bps point to its gang members, which hold the data.


Gang blocks can be nested: a gang member may itself be a gang block. Thus every gang block is a tree in which root and all interior nodes are gang headers, and the leaves are normal blocks that contain user data. The root of the gang tree is called the gang leader.

A 'gang header' contains three full block pointers, some padding, and then a trailing checksum. The whole thing is sized so that it takes up only a single 512-byte sector; I believe this means that gang headers in ashift=12 vdevs waste a bunch of space, or at least leave the remaining 3.5 Kb unused.

To understand more about gang blocks, we need to understand why they're needed. As far as I know, this comes down to the fact that ZFS files only ever have a single (logical) block size. If a file is less than the recordsize (usually 128 Kb), it's in a single logical block of the appropriate power of two size; once it hits recordsize or greater, it's in a number of recordsize'd blocks. This means that writing new data to most files normally requires allocating some size of contiguous block (up to 128 Kb, but less if the data you're writing is compressible).

(I believe that there is also metadata that's always unfragmented and may be in blocks up to 128 Kb.)

However, ZFS doesn't guarantee that a pool always has free 128 Kb chunks available, or in fact any particular size of chunk. Instead, free space can be fragmented; you might be unfortunate enough to have many gigabytes of free space, but all of it in fragments that were, say, 32 Kb and smaller. This is where ZFS needs to resort to gang blocks, basically in order to lie to itself about still writing single large blocks.

(Before I get too snarky, I should say that this lie probably simplifies the life of higher level code a fair bit. Rather than have a whole bunch of data and metadata handling code that has to deal with all sorts of fragmentation, most of ZFS can ignore the issue and then lower level IO code quietly makes it all work. Actually using gang blocks should be uncommon.)

All of this explains why the gang block bit is a property of the DVA, not of anything else. The DVA is where space gets allocated, so the DVA is where you may need to shim in a gang block instead of getting a contiguous chunk of space. Since different vdevs generally have different levels of fragmentation, whether or not you have a contiguous chunk of the necessary size will often vary from vdev to vdev, which is the DVA level again.

One quiet complication created by gang blocks is that according to comments in the source code, the gang members may not wind up on the same vdev as the gang header (although ZFS tries to keep them on the same vdev because it makes life easier). This is different from regular blocks, which are always only on a single vdev (although they may be spread across multiple disks if they're on a raidz vdev).

Gang blocks have some space overhead compared to regular blocks (in addition to being more fragmented on disk), but how much is quite dependent on the situation. Because each gang header can only point to three gang member blocks, you may wind up needing multiple levels of nested gang blocks if you have an unlucky combination of fragmented free space and a large block to write. As an example, suppose that you need to write a 128 Kb block and the pool only has 32 Kb chunks free. 128 Kb requires four 32 Kb chunks, which is more than a single gang header can point to, so you need a nested gang block; your overhead is two sectors for the two gang headers needed. If the pool was more heavily fragmented, you'd need more nested gang blocks and the overhead would go up. If the pool had a single 64 Kb chunk left, you could have written the 128 Kb with two 32 Kb chunks and the 64 Kb chunk and thus not needed the nested gang block with its additional gang header.

(Because ZFS only uses a gang block when the space required isn't available in a contiguous block, gang blocks are absolutely sure to be scattered on the disk.)

PS: As far as I can see, a pool doesn't keep any statistics on how many times gang blocks have been necessary or how many there currently are in the pool.

ZFSGangBlocks written at 02:55:39; Add Comment


Confirming the behavior of file block sizes in ZFS

ZFS filesystems have a property called their recordsize, which is usually described as something like the following (from here):

All files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks.

A while back I wrote about using zdb to peer into how ZFS stores files on disk, where I looked into how ZFS stored a 160 Kb file and specifically if it really did use two 128 Kb blocks to hold it, instead of a 128 Kb block and a 32 Kb block. The answer was yes, with some additional discoveries about ZFS compression and partial blocks.

Today I wound up wondering once again if that informal description of how ZFS behaves was really truly the case. Specifically, I wondered if there were situations where ZFS could wind up with a mixture of block sizes, say a 4 Kb block that was written initially at the start of the file and then a larger block written later after a big hole in the file. If ZFS really always stored sufficiently large files with only recordsize blocks, it would have to go back to rewrite the initial 4 Kb block, which seemed a bit odd to me given ZFS's usual reluctance to rewrite things.

So I did this experiment. We start out with a 4 Kb file, sync it, verify (with zdb) that it's there on disk and looks like we expect, and then extend the file with a giant hole, writing 32 Kb at 512 Kb into the file:

dd if=/dev/urandom of=testfile bs=4k count=1
[wait, check with zdb]
dd if=/dev/urandom of=testfile bs=32k seek=19 count=1 conv=notrunc

The first write creates a testfile that had a ZFS file block size of 4 Kb (which zdb prints as the dblk field); this is the initial conditions we expect. We can also see a single 4 Kb data block at offset 0:

# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile
Indirect blocks:
     0 L0 0:204ea46a00:1000 1000L/1000P F=1 B=5401327/5401327

After writing the additional 32 Kb, zdb reports that the file's block size has jumped up to 128 Kb, the standard ZFS dataset recordsize; this again is what we expect. However, it also reports a change in the indirect blocks. They are now:

Indirect blocks:
     0 L1  0:200fdf4200:400 20000L/400P F=2 B=5401362/5401362
     0  L0 0:200fdf2e00:1400 20000L/1400P F=1 B=5401362/5401362
 80000  L0 0:200fdeaa00:8400 20000L/8400P F=1 B=5401362/5401362

The L0 indirect block that starts at file offset 0 has changed. It's been rewritten from a 4 Kb logical / 4 Kb physical block to being 128 Kb logical and 5 Kb physical (this is still an ashift=9 pool), and the TXG it was created in (the B= field) is the same as the other blocks.

So what everyone says about the ZFS recordsize is completely true. ZFS files only ever have one (logical) block size, which starts out as small as it can be and then expands out as the file gets more data (or, more technically, as the maximum offset of data in the file increases). If you push it, ZFS will rewrite existing data you're not touching in order to expand the (logical) block size out to the dataset recordsize.

If you think about it, this rewriting is not substantially different from what happens if you write 4 Kb and then write another 4 Kb after it. Just as here, ZFS will replace your initial 4 Kb data block with an 8 Kb data block; it just feels more a bit more expected because both the old and the new data falls within the first full 128 Kb recordsize block of the file.

(Apparently, every so often something in ZFS feels sufficiently odd to me that I have to go confirm it for myself, just to be sure and so I can really believe in it without any lingering doubts.)

ZFSFileRecordsizeGrowth written at 01:33:44; Add Comment


Some details of ZFS DVAs and what some of their fields store

One piece of ZFS terminology is DVA and DVAs, which is short for Data Virtual Address. For ZFS, a DVA is the equivalent of a block number in other filesystems; it tells ZFS where to find whatever data we're talking about. DVAs are generally embedded into 'block pointers', and you can find a big comment laying out the entire structure of all of this in spa.h. The two fields of a DVA that I'm interested in today are the vdev and the offset.

(The other three fields are a reserved field called GRID, a bit to say whether the DVA is for a gang block, and asize, the allocated size of the block on its vdev. The allocated size has to be a per-DVA field for various reasons. The logical size of the block and its physical size after various sorts of compression are not DVA or vdev dependent, so they're part of the overall block pointer.)

The vdev field of a DVA is straightforward; it is the index of the vdev that the block is on, starting from zero for the first vdev and counting up. Note that this is not the GUID of the vdev involved, which is what you might sort of expect given a comment that calls it the 'virtual device ID'. Using the index means that ZFS can never shuffle the order of vdevs inside a pool, since these indexes are burned into DVAs stored on disk (as far as I know, and this matches what zdb prints, eg).

The offset field tells you where to find the start of the block on the vdev in question. Because this is an offset into the vdev, not a device, different sorts of vdevs have different ways of translating this into specific disk addresses. Specifically, RAID-Z vdevs must generally translate a single incoming IO at a single offset to the offsets on multiple underlying disk devices for multiple IOs.

At this point we arrive at an interesting question, namely what units the offset is in (since there are a bunch of possible options). As far as I can tell from looking at the ZFS kernel source code, the answer is that the DVA offset is in bytes. Some sources say that it's in 512-byte sectors, but as far as I can tell this is not correct (and it's certainly not in larger units, such as the vdev's ashift).

(This doesn't restrict the size of vdevs in any important way, since the offset is a 63-bit field.)

One potentially important consequence of this is that DVA offsets are independent of the sector size of the underlying disks in vdevs. Provided that your vdev asize is large enough, it doesn't matter if you use disks with 512-byte logical sectors or the generally rarer disks with real 4k sectors (both physical and logical), and you can replace one with the other. Well, in theory, as there may be other bits of ZFS that choke on this (I don't know if ZFS's disk labels care, for example). But DVAs won't, which means that almost everything in the pool (metadata and data both) should be fine.

PS: There are additional complications for ZFS gang blocks and so on, but I'm omitting that in the interests of keeping this manageable.

ZFSDVAOffsetVdevDetails written at 01:49:19; Add Comment


Our next generation of fileservers will not be based on Illumos

Our current generation of ZFS NFS fileservers are based on OmniOS. We've slowly been working on the design of our next generation for the past few months, and one of the decisions we've made is that unless something really unusual happens, we won't be using any form of Illumos as the base operating system. While we're going to continue using ZFS, we'll be basing our fileservers on either ZFS on Linux or FreeBSD (preferably ZoL, because we already run lots of Linux machines and we don't have any FreeBSD ones).

This is not directly because of uncertainties around OmniOS CE's future (or the then lack of a LTS release that I wrote about here, because it now has one). There is really no single cause that could change our minds if it was fixed or changed; instead there are multiple contributing factors. Ultimately we made our decision because we are not in love with OmniOS and we no longer think we need to run it in order to get what we really want, which is ZFS with solid NFS fileservice.

However, I feel I need to mention some major contributing factors. The largest single factor is our continued lack of confidence in Illumos's support for Intel 10G-T chipsets. As far as I can tell from the master Illumos source, nothing substantial has changed here since back in 2014, and certainly I don't consider it a good sign that the ixgbe driver still does kernel busy-waits for milliseconds at a time. We consider 10G-T absolutely essential for our next generation of fileservers and we don't want to take chances.

(If you want to see how those busy-waits happens, look at the definition of msec_delay in ixgbe_osdep.h. drv_usecwait is specifically defined to busy-wait; it's designed to be used for microsecond durations, not millisecond ones.)

Another significant contributing factor is our frustrations with OmniOS's KYSTY minimalism, which makes dealing with our OmniOS machines more painful than dealing with our Linux ones (even the Linux ones that aren't Ubuntu based). And yes, having differently named commands does matter. It's possible that another Illumos based distribution could do better here, but I don't think there's a better one for our needs and it would still leave us with our broad issues with Illumos.

It's undeniable that we have more confidence in Linux on the whole than we do in Illumos. Linux is far more widely and heavily used, generally supports more hardware (and does so more promptly), and we've already seen that Intel 10G-T cards work fine in it (we have them in a number of our existing Linux machines, where they run great). Basically the only risk area is ZFS on Linux, and we have FreeBSD as a fallback.

There are some aspects of OmniOS that I will definitely miss, most notably DTrace. Modern Linux may have more or less functional equivalents, but I don't think there's anything that's half as usable. However on the whole I have no sentimental attachments to Solaris or Illumos; I don't hate it, but I won't miss it on the whole and an all-Linux environment will make my life simpler.

(This decision is only partly related to our decision not to use a SAN in the next generation of fileservers. While we could probably use OmniOS with the local disk setup that we want, not having to worry about Illumos's hardware support for various controller hardware does make our lives simpler.)

IllumosNoFutureHere written at 00:11:10; Add Comment


Sequential scrubs and resilvers are coming for (open-source) ZFS

Oracle has made a number of changes and improvements to Solaris ZFS since they took it closed source. Mostly I've been indifferent to their changes, but the one improvement I've long envied is their sequential resilvering (and scrubbing) (this apparently first appeared in Solaris 11.2, per here and here). That ZFS scrubs and resilvers aren't sequential has long been a quiet pain point for a lot of people. Apparently it's especially bad for RAID-Z pools (perhaps because of the usual RAID-Z random read issue), but it's been an issue for us in the past with mirrors (although we managed to speed that up).

Well, there's great news here for all open source ZFS implementations, including Illumos distributions, because an implementation of sequential scrubs and resilvers just landed in ZFS on Linux in this commit (apparently it'll be included in ZoL 0.8 whenever that's released). The ZFS on Linux work was done by Tom Caputi of Datto, building on work done by Saso Kiselkov of Nexenta. Saso Kiselkov's work was presented at the 2016 OpenZFS developer summit and got an OpenZFS wiki summary page; Tom Caputi presented at the 2017 summit. Both have slides (and talk videos) if you want more information on how this works.

(It appears that the Nexenta work may be 'NEX-6068', included in NexentaStor 5.0.3. I can't find a current public source tree for Nexenta, so I don't know anything more than that.)

For how it works, I'll just quote from the commit message:

This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance.

My early experience with this in the current ZoL git tree has been very positive. I saw a single-vdev mirror pool on HDs with 293 GB used go from a scrub time of two hours and 25 minutes to one hour and ten minutes.

Although this is very early days for this feature even in ZFS on Linux, I'd expect it to get pushed (or pulled) upstream later and thus go into Illumos. I have no idea when that might happen; it might be reasonable to wait until ZFS on Linux has included it in an actual release so that it sees some significant testing in the field. Or people could find this an interesting and important enough change that they actively work to bring it upstream, if only for testing there.

(At this point I haven't spotted any open issues about this in the Illumos issue tracker, but as mentioned I don't really expect that yet unless someone wants to get a head start.)

PS: Unlike Oracle's change for Solaris 11.2, which apparently needed a pool format change (Oracle version 35, according to Wikipedia), the ZFS on Linux implementation needs no new pool feature and so is fully backward compatible. I'd expect this to be true for any eventual Illumos version unless people find some hard problem that forces the addition of a new pool feature.

ZFSSequentialScrubIsComing written at 00:08:24; Add Comment

(Previous 10 or go back to November 2017 at 2017/11/03)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.