Thinking about why ZFS only does IO in
recordsize blocks, even random IO
As I wound up experimentally verifying,
in ZFS all files are stored as a single block of varying size up
to the filesystem's
recordsize, or using multiple
blocks. As is perhaps less well known, a ZFS logical block is the minimum size of IO to a
file, both for reads and especially for writes. Since the default
recordsize is 128 Kb, this means that many files of interest are
recordsize blocks and thus all IO to them is done in 128 Kb
units, even if you're only reading or writing a small amount of
On the one hand, this seems a little bit crazy. The time it takes to transfer 128 Kb over a SATA link is not always something that you can ignore, and on SSDs larger writes can have a real impact. On the other hand, I think that this choice is more or less forced by some decisions that ZFS has made. Specifically, the ZFS checksum covers the entire logical block, and ZFS's data structure for 'where you find things on disk' is also based on logical blocks.
I wrote before about the ZFS DVA, which
is ZFS's equivalent of a block number and tells you where to find
data. ZFS DVAs are embedded into 'block pointers', which you can
find described in spa.h.
One of the fields of the block pointer is the ZFS block checksum.
Since this is part of the block pointer, it is a checksum over all
of the (logical) data in the block, which is up to
recordsize. Once a file reaches
recordsize bytes long,
all blocks are the same size, the
Since the ZFS checksum is over the entire logical block, ZFS has
to fetch the entire logical block in order to verify the checksum
on reads, even if you're only asking for 4 Kbytes out of it. For
writes, even if ZFS allowed you to have different sized logical
blocks in a file, you'd need to have the original
available in order to split it and you'd have to write all of it
back out (both because ZFS never overwrites in place and because
the split creates new logical blocks, which need new checksums).
Since you need to add new logical blocks, you might have a ripple
effect in ZFS's equivalent of indirect blocks, where they must
expand and shuffle things around.
(If you're not splitting the logical block when you write to only a part of it, copy on write means that there's no good way to do this without rewriting the entire block.)
In fact, the more I think about this, the more it seems that having multiple (logical) block sizes in a single file would be the way to madness. There are so many things that get complicated if you allow variable block sizes. These issues can be tackled, but it's simpler not to. ZFS's innovation is not that it insists that files have a single block size, it is that it allows this block size to vary. Most filesystems simply set the block size to, say, 4 Kbytes, and live with how large files have huge indirect block tables and other issues.
(The one thing that might make ZFS nicer in the face of some access
patterns where this matters is the ability to set the
on a per-file basis instead of just a per-filesystem basis. But I'm
not sure how important this would be; the kind of environments where
it really matters are probably already doing things like putting
database tables on their own filesystems anyway.)
PS: This feels like an obvious thing once I've written this entry
all the way through, but the ZFS
recordsize issue has been one
of my awkward spots for years, where I didn't really understand why
it all made sense and had to be the way it was.
PPS: All of this implies that if ZFS did split logical blocks when you did a partial write, the only time you'd win would be if you then overwrote what was now a single logical block a second time. For example, if you created a big file, wrote 8 Kb to a spot in it (splitting a 128 Kb block into several new logical blocks, including an 8 Kb one for the write you just did), then later wrote exactly 8 Kb again to exactly that spot (overwriting only your new 8 Kb logical block). This is probably obvious too but I wanted to write it out explicitly, if only to convince myself of the logic.
Some exciting ZFS features that are in OmniOS CE's (near) future
I recently wrote about how much better ZFS pool recovery is coming, which reported on Pavel Zakharov's Turbocharging ZFS Data Recovery. In that, Zakharov said that the first OS to get it would likely be OmniOS CE, although he didn't have a timeline. Since I just did some research on this, let's run down some exciting ZFS features that are almost certainly in OmniOS CE's near future, and where they are.
There are two big ZFS features from Delphix that have recently landed in the main Illumos tree, and a third somewhat smaller one:
- This better ZFS pool recovery, which landed as a series of changes
culminating in issue 9075
in February or so. Although I can't be sure, I believe that a
recovered pool is fully compatible with older ZFS versions,
although for major damage you're going to be copying data out
of the pool to a new one.
- The long awaited feature of shrinking
ZFS pools by removing vdevs, which landed as issue 7614 in January. Using this will
add a permanent feature flag to your pool that makes it fully
incompatible with older ZFS versions.
- A feature for checkpointing the overall ZFS pool state before you do potentially dangerous operations and can then rewind to them, issue 9166, which landed just a few days ago. Since one of the purposes of better ZFS pool recovery is to provide (better) recovery over pool configuration changes, I suspect that this new pool checkpointing helps out with it. This makes your pool relatively incompatible with older ZFS versions while a checkpoint exist.
(Apparently Delphix was only able to push these upstream from their own code base to Illumos and OpenZFS relatively recently.)
All of these features have been pulled into the 'master' branch of the OmniOS CE repository from the main Illumos repo where they landed. Unless something unusual happens, I would expect them all to be included in the next release of OmniOS CE, which their release schedule says is to be r151026, expected some time this May. This will not be an LTS release; if you want to wait for an LTS release to have these features, you're waiting until next year. Given the likely magnitude of these changes and the relatively near future release of r151026, I wouldn't expect OmniOS CE to include these in a future update to the current r151024 or especially r151022 LTS.
Since OmniOS CE Bloody integrates kernel and user updates on a regular basis, I suspect that it already has many of these features and will pick up the most recent one very soon. If this is so, it gives OmniOS people a clear path if they need to recover a damaged pool; you can boot a Bloody install media or otherwise temporarily run it, repair or import the pool, possibly copying the data to another pool, and then probably revert back to running your current OmniOS with the recovered pool.
Sidebar: How to determine this sort of stuff
The most convenient way is to look at the the git log for commits that involve, say, usr/src/uts/common/fs/zfs, the kernel ZFS code, in the OmniOS CE repo. In the Github interface, this is drilling down to that directory and then picking the 'History' option; a convenient link for this for the 'master' branch is here.
Each OmniOS CE release gets its own branch in the repo, named in the obvious way, and each branch thus has its own commit history for ZFS. Here is the version for r151024. Usefully, Github forms the URLs for these things in a very predictable way, making it very easy to hand-write your own URLs for specific things (eg, the same for r151022 LTS, which shows only a few recent ZFS changes).
There's no branch for OmniOS CE Bloody, so I believe that it's simply built from the 'master' branch. It is the bleeding edge version, after all.
Much better ZFS pool recovery is coming (in open source ZFS)
One of the long standing issues in ZFS has been that while it's usually very resilient, it can also be very fragile if the wrong things get damaged. Classically, ZFS has had two modes of operation; either it would repair any damage or it would completely explode. There was no middle ground of error recovery, and this isn't a great experience; as I wrote once, panicing the system is not an error recovery strategy. In early versions of ZFS there was no recovery at all (you restored from backups); in later versions, ZFS added a feature where you could attempt to recover from damaged metadata by rewinding time, which was better than nothing but not a complete fix.
The good news is that that's going to change, and probably not too long from now. What you want to read about this is Turbocharging ZFS Data Recovery, by Pavel Zakharov of Delphix, which covers a bunch of work that he's done to make ZFS more resilient and more capable of importing various sorts of damaged pools. Of particular interest is the ability to likely recover at least some data from a pool that's lost an entire vdev. You can't get everything back, obviously, but ZFS metadata is usually replicated on multiple vdevs so losing a single vdev will hopefully leave you with enough left to at least get the rest of the data out of the pool.
All of this is really great news. ZFS has long needed better options for recovery from various pool problems, as well as better diagnostics for failed pool imports, and I'm quite happy that the situation is finally going to be improving.
The article is also interesting for its discussion of the current low level issues involved in pool importing. For example, until I read it I had no idea about how potentially dangerous a ZFS pool vdev change was due to how pool configurations are handled during the import process. I'd love to read more details on how pool importing really works and what the issues are (it's a long standing interest of mine), but sadly I suspect that no one with that depth of ZFS expertise has the kind of time it would take to write such an article.
As far as the timing of these features being available in your ZFS-using OS of choice goes, his article says this:
As of March 2018, it has landed on OpenZFS and Illumos but not yet on FreeBSD and Linux, where I’d expect it to be upstreamed in the next few months. The first OS that will get this feature will probably be OmniOS Community Edition, although I do not have an exact timeline.
If you have a sufficiently important damaged pool, under some circumstances it may be good enough if there is some OS, any OS, that can bring up the pool to recover the data in it. For all that I've had my issues with OmniOS's hardware support, OmniOS CE does have fairly decent hardware support and you can probably get it going on most modern hardware in an emergency.
(And if OmniOS can't talk directly to your disk hardware, there's
always iSCSI, as we can testify. There's
probably also other options for remote disk access that OmniOS ZFS
zdb can deal with.)
PS: If you're considering doing this in the future and your normal OS is something other than Illumos, you might want to pay attention to the ZFS feature flags you allow to be set on your pool, since this won't necessarily work if your pool uses features that OmniOS CE doesn't (yet) support. This is probably not going to be an issue for FreeBSD but might be an issue for ZFS on Linux. You probably want to compare the ZoL manpage on ZFS pool features with the Illumos version or even the OmniOS CE version.
Sidebar: Current ZFS pool feature differences
The latest Illumos tree has three new ZFS pool features from Delphix: device_removal, obsolete_counts (which enhances device removal), and zpool_checkpoint. These are all fairly recent additions; they appear to have landed in the Illumos tree this January and just recently, although the commits that implement them are dated from 2016.
ZFS on Linux has four new pool features: large_dnode, project_quota, userobj_accounting, and encryption. Both large dnodes and encryption have to be turned on explicitly, and the other two are read-only compatible, so in theory OmniOS can bring a pool up read-only even with them enabled (and you're going to want to have the pool read-only anyway).
Some things about ZFS block allocation and ZFS (file) record sizes
As I wound up experimentally verifying,
in ZFS all files are stored as a single block of varying size up
to the filesystem's
recordsize, or using multiple recordsize
blocks. For a file under the recordsize, the block size turns
out to be in a multiple of 512 bytes, regardless
of the pool's
ashift or the physical sector size of the drives
the pool is using.
Well, sort of. While everything I've written is true, it also turns out to be dangerously imprecise (as I've seen before). There are actually three different sizes here and the difference between them matters once we start getting into the fine details.
To talk about these sizes, I'll start with some illustrative
output for a file data block, as before:
0 L0 DVA=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]
The first size of the three is the logical block size, before
compression. This is the first
size= number ('4200L' here, in hex
and L for logical). This is what grows in 512-byte units up to the
recordsize and so on.
The second size is the physical size after compression, if any;
this is the second
size= number ('4200P' here, P for physical).
It's a bit weird. If the file can't be compressed, it is the same
as the logical size and because the logical size goes in 512-byte
units, so does this size, even on
ashift=12 pools. However, if
compression happens this size appears to go by the
means it doesn't necessarily go in 512-byte units. On an
pool you'll see it go in 512-byte units (so you can have a compressed
size of '400P', ie 1 KB), but the same data written in an
pool winds up being in 4 Kb units (so you wind up with a compressed
size of '1000P', ie 4 Kb).
The third size is the actual allocated size on disk, as recorded
in the DVA's asize field (which
is the third subfield in the
DVA portion). This is always in
ashift-based units, even if the physical size is not. Thus you
can wind up with a 20 KB DVA but a 16.5
KB 'physical' size, as in our example (the DVA is '5000' while the
block physical size is '4200').
(I assume this happens because ZFS insures that the physical size is never larger than the logical size, although the DVA allocated size may be.)
For obvious reasons, it's the actual allocated size on disk (the DVA asize) that matters for things like rounding up raidz allocation to N+1 blocks, fragmentation, and whether you need to use a ZFS gang block. If you write a 128 KB (logical) block that compresses to a 16 KB physical block, it's 16 KB of (contiguous) space that ZFS needs to find on disk, not 128 KB.
On the one hand, how much this matters depends on how compressible your data is and much modern data isn't (because it's already been compressed in its user-level format). On the other hand, as I found out, 'sparse' space after the logical end of file is very compressible. A 160 KB file on a standard 128 KB recordsize filesystem takes up two 128 KB logical blocks, but the second logical block has 96 KB of nothingness at the end and that compresses down to almost nothing.
PS: I don't know if it's possible to mix vdevs with different
ashifts in the same pool. If it is, I don't know how ZFS would
ashift to use for the physical block size. The minimum
ashift in any vdev? The maximum
(This is the second ZFS entry in a row where I thought I knew what was going on and it was simple, and then discovered that I didn't and it isn't.)
A surprise in how ZFS grows a file's record size (at least for me)
As I wound up experimentally verifying,
in ZFS all files are stored as a single block of varying size up
to the filesystem's
recordsize, or using multiple
blocks. If a file has more than one block, all blocks are
no more and no less. If a file is a single block, the size of this
block is based on how much data has been written to the file (or
technically the maximum offset that's been written to the file).
how the block size grows as you write data to the file turns out
to be somewhat surprising (which makes me very glad that I actually
did some experiments to verify what I thought I knew before I wrote
this entry, because I was very wrong).
Rather than involving the
ashift or growing in powers of two,
ZFS always grows the (logical) block size in 512-byte chunks
until it reaches the filesystem
recordsize. The actual physical
space allocated on disk is in
ashift sized units, as you'd expect,
but this is not directly related to the (logical) block size used
at the file level. For example, here is a 16896 byte file (of
incompressible data) on an
Object lvl iblk dblk dsize dnsize lsize %full type 4780566 1 128K 16.5K 20K 512 16.5K 100.00 ZFS plain file [...] 0 L0 DVA=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]
The DVA records an 0x5000 byte allocation (20 Kb), but the logical and physical-logical size are only 0x4200 bytes (16.5 Kb).
In thinking about it, this makes a certain amount of sense because
ashift is really a vdev property, not a pool property, and
can vary from vdev to vdev within a single pool. As a result, the
actual allocated size of a given block may vary from vdev to vdev
(and a block may be written to multiple vdevs if you have
set to more than 1 or it's metadata). The file's current block size
thus can't be based on the
ashift, because ZFS doesn't necessarily
have a single
ashift to base it on; instead ZFS bases it on 512-byte
sectors, even if this has to be materialized differently on different
Looking back, I've already sort of seen this with ZFS compression. As you'd expect, a file's (logical) block size is based on its uncompressed size, or more exactly on the highest byte offset in the file. You can write something to disk that compresses extremely well, and it will still have a large logical block size. Here's an extreme case:
; dd if=/dev/zero of=testfile bs=128k count=1 [...] # zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile Object lvl iblk dblk dsize dnsize lsize %full type 956361 1 128K 128K 0 512 128K 0.00 ZFS plain file [...]
This turns out to have no data blocks allocated at all, because the 128 Kb of zeros can be recorded entirely in magic flags in the dnode. But it still has a 128 Kb logical block size. 128 Kb of the character 'a' does wind up requiring a DVA allocation, but the size difference is drastic:
Object lvl iblk dblk dsize dnsize lsize %full type 956029 1 128K 128K 1K 512 128K 100.00 ZFS plain file [...] 0 L0 DVA=<0:3bbd1c00:400> [L0 ZFS plain file] [...] size=20000L/400P [...]
We have a compressed size of 1 Kb (and a 1 Kb allocation on disk,
as this is an
ashift=9 vdev), but once again the file block size
is 128 Kb.
(If we wrote 127.5 Kb of 'a' instead, we'd wind up with a file block size of 127.5 Kb. I'll let interested parties do that experiment themselves.)
What this means is that ZFS has much less wasted space than I thought
it did for files that are under the
recordsize. Since such files
grow their logical block size in 512-byte chunks, even with no
compression they waste at most almost all of one physical block on
disk (if you have a file that is, say, 32 Kb plus one byte, you'll
have a physical block on disk with only one byte used). This has
some implications for other areas of ZFS, but those are for another
(This is one of those entries that I'm really glad that I decided to write. I set out to write it as a prequel to another entry just to have how ZFS grew the block size of files written down explicitly, but wound up upending my understanding of the whole area. The other lesson for me is that verifying my understanding with experiments is a really good idea, because every so often my folk understanding is drastically wrong.)
What ZFS gang blocks are and why they exist
If you read up on ZFS internals, sooner or later you will run across references to 'gang blocks'. For instance, they came up when I talked about what's in a DVA, where DVAs have a flag to say that they point to a gang block instead of a regular block. Gang blocks are vaguely described as being a way of fragmenting a large logical block into a bunch of separate sub-blocks.
A more on-point description can be found in the (draft) ZFS on-disk specification (PDF, via) or the source code comments about them in zio.c. I'll selectively quote from zio.c because it's easier to follow:
A gang block is a collection of small blocks that looks to the DMU like one large block. When zio_dva_allocate() cannot find a block of the requested size, due to either severe fragmentation or the pool being nearly full, it calls zio_write_gang_block() to construct the block from smaller fragments.
A gang block consists of a gang header and up to three gang members. The gang header is just like an indirect block: it's an array of block pointers. It consumes only one sector and hence is allocatable regardless of fragmentation. The gang header's bps point to its gang members, which hold the data.
Gang blocks can be nested: a gang member may itself be a gang block. Thus every gang block is a tree in which root and all interior nodes are gang headers, and the leaves are normal blocks that contain user data. The root of the gang tree is called the gang leader.
A 'gang header' contains three full block pointers, some padding, and then a trailing checksum. The whole thing is sized so that it takes up only a single 512-byte sector; I believe this means that gang headers in ashift=12 vdevs waste a bunch of space, or at least leave the remaining 3.5 Kb unused.
To understand more about gang blocks, we need to understand why
they're needed. As far as I know, this comes down to the fact that
ZFS files only ever have a single (logical) block size. If a file is less than the
(usually 128 Kb), it's in a single logical block of the appropriate
power of two size; once it hits
recordsize or greater, it's in a
recordsize'd blocks. This means that writing new data
to most files normally requires allocating some size of contiguous
block (up to 128 Kb, but less if the data you're writing is
(I believe that there is also metadata that's always unfragmented and may be in blocks up to 128 Kb.)
However, ZFS doesn't guarantee that a pool always has free 128 Kb chunks available, or in fact any particular size of chunk. Instead, free space can be fragmented; you might be unfortunate enough to have many gigabytes of free space, but all of it in fragments that were, say, 32 Kb and smaller. This is where ZFS needs to resort to gang blocks, basically in order to lie to itself about still writing single large blocks.
(Before I get too snarky, I should say that this lie probably simplifies the life of higher level code a fair bit. Rather than have a whole bunch of data and metadata handling code that has to deal with all sorts of fragmentation, most of ZFS can ignore the issue and then lower level IO code quietly makes it all work. Actually using gang blocks should be uncommon.)
All of this explains why the gang block bit is a property of the DVA, not of anything else. The DVA is where space gets allocated, so the DVA is where you may need to shim in a gang block instead of getting a contiguous chunk of space. Since different vdevs generally have different levels of fragmentation, whether or not you have a contiguous chunk of the necessary size will often vary from vdev to vdev, which is the DVA level again.
One quiet complication created by gang blocks is that according to comments in the source code, the gang members may not wind up on the same vdev as the gang header (although ZFS tries to keep them on the same vdev because it makes life easier). This is different from regular blocks, which are always only on a single vdev (although they may be spread across multiple disks if they're on a raidz vdev).
Gang blocks have some space overhead compared to regular blocks (in addition to being more fragmented on disk), but how much is quite dependent on the situation. Because each gang header can only point to three gang member blocks, you may wind up needing multiple levels of nested gang blocks if you have an unlucky combination of fragmented free space and a large block to write. As an example, suppose that you need to write a 128 Kb block and the pool only has 32 Kb chunks free. 128 Kb requires four 32 Kb chunks, which is more than a single gang header can point to, so you need a nested gang block; your overhead is two sectors for the two gang headers needed. If the pool was more heavily fragmented, you'd need more nested gang blocks and the overhead would go up. If the pool had a single 64 Kb chunk left, you could have written the 128 Kb with two 32 Kb chunks and the 64 Kb chunk and thus not needed the nested gang block with its additional gang header.
(Because ZFS only uses a gang block when the space required isn't available in a contiguous block, gang blocks are absolutely sure to be scattered on the disk.)
PS: As far as I can see, a pool doesn't keep any statistics on how many times gang blocks have been necessary or how many there currently are in the pool.
Confirming the behavior of file block sizes in ZFS
ZFS filesystems have a property called their
is usually described as something like the following (from here):
All files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks.
A while back I wrote about using
zdb to peer into how ZFS stores
files on disk, where I looked into how ZFS
stored a 160 Kb file and specifically if it really did use two 128
Kb blocks to hold it, instead of a 128 Kb block and a 32 Kb block.
The answer was yes, with some additional discoveries about ZFS
compression and partial blocks.
Today I wound up wondering once again if that informal description of how ZFS behaves was really truly the case. Specifically, I wondered if there were situations where ZFS could wind up with a mixture of block sizes, say a 4 Kb block that was written initially at the start of the file and then a larger block written later after a big hole in the file. If ZFS really always stored sufficiently large files with only recordsize blocks, it would have to go back to rewrite the initial 4 Kb block, which seemed a bit odd to me given ZFS's usual reluctance to rewrite things.
So I did this experiment. We start out with a 4 Kb file, sync it,
zdb) that it's there on disk and looks like we expect,
and then extend the file with a giant hole, writing 32 Kb at 512
Kb into the file:
dd if=/dev/urandom of=testfile bs=4k count=1 sync [wait, check with zdb] dd if=/dev/urandom of=testfile bs=32k seek=19 count=1 conv=notrunc sync
The first write creates a
testfile that had a ZFS file block size
of 4 Kb (which
zdb prints as the
dblk field); this is the initial
conditions we expect. We can also see a single 4 Kb data block at
# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile [...] Indirect blocks: 0 L0 0:204ea46a00:1000 1000L/1000P F=1 B=5401327/5401327
After writing the additional 32 Kb,
zdb reports that the file's
block size has jumped up to 128 Kb, the standard ZFS dataset
recordsize; this again is what we expect. However, it also
reports a change in the indirect blocks. They are now:
Indirect blocks: 0 L1 0:200fdf4200:400 20000L/400P F=2 B=5401362/5401362 0 L0 0:200fdf2e00:1400 20000L/1400P F=1 B=5401362/5401362 80000 L0 0:200fdeaa00:8400 20000L/8400P F=1 B=5401362/5401362
The L0 indirect block that starts at file offset 0 has changed.
It's been rewritten from a 4 Kb logical / 4 Kb physical block to
being 128 Kb logical and 5 Kb physical (this is still an
pool), and the TXG it was created
B= field) is the same as the other blocks.
So what everyone says about the ZFS recordsize is completely true.
ZFS files only ever have one (logical) block size, which starts out
as small as it can be and then expands out as the file gets more
data (or, more technically, as the maximum offset of data in the
file increases). If you push it, ZFS will rewrite existing data
you're not touching in order to expand the (logical) block size out
to the dataset
If you think about it, this rewriting is not substantially different
from what happens if you write 4 Kb and then write another 4 Kb
after it. Just as here, ZFS will replace your initial 4 Kb data
block with an 8 Kb data block; it just feels more a bit more expected
because both the old and the new data falls within the first full
recordsize block of the file.
(Apparently, every so often something in ZFS feels sufficiently odd to me that I have to go confirm it for myself, just to be sure and so I can really believe in it without any lingering doubts.)
Some details of ZFS DVAs and what some of their fields store
One piece of ZFS terminology is DVA and DVAs, which is short for Data Virtual Address. For ZFS, a DVA is the equivalent of a block number in other filesystems; it tells ZFS where to find whatever data we're talking about. DVAs are generally embedded into 'block pointers', and you can find a big comment laying out the entire structure of all of this in spa.h. The two fields of a DVA that I'm interested in today are the vdev and the offset.
(The other three fields are a reserved field called GRID, a bit to say whether the DVA is for a gang block, and asize, the allocated size of the block on its vdev. The allocated size has to be a per-DVA field for various reasons. The logical size of the block and its physical size after various sorts of compression are not DVA or vdev dependent, so they're part of the overall block pointer.)
The vdev field of a DVA is straightforward; it is the index of the
vdev that the block is on, starting from zero for the first vdev
and counting up. Note that this is not the GUID of the vdev involved,
which is what you might sort of expect given a comment that calls
it the 'virtual device ID'. Using the index means that ZFS can never
shuffle the order of vdevs inside a pool, since these indexes are
burned into DVAs stored on disk (as far as I know, and this matches
zdb prints, eg).
The offset field tells you where to find the start of the block on the vdev in question. Because this is an offset into the vdev, not a device, different sorts of vdevs have different ways of translating this into specific disk addresses. Specifically, RAID-Z vdevs must generally translate a single incoming IO at a single offset to the offsets on multiple underlying disk devices for multiple IOs.
At this point we arrive at an interesting question, namely what units the offset is in (since there are a bunch of possible options). As far as I can tell from looking at the ZFS kernel source code, the answer is that the DVA offset is in bytes. Some sources say that it's in 512-byte sectors, but as far as I can tell this is not correct (and it's certainly not in larger units, such as the vdev's ashift).
(This doesn't restrict the size of vdevs in any important way, since the offset is a 63-bit field.)
One potentially important consequence of this is that DVA offsets are independent of the sector size of the underlying disks in vdevs. Provided that your vdev asize is large enough, it doesn't matter if you use disks with 512-byte logical sectors or the generally rarer disks with real 4k sectors (both physical and logical), and you can replace one with the other. Well, in theory, as there may be other bits of ZFS that choke on this (I don't know if ZFS's disk labels care, for example). But DVAs won't, which means that almost everything in the pool (metadata and data both) should be fine.
PS: There are additional complications for ZFS gang blocks and so on, but I'm omitting that in the interests of keeping this manageable.
Our next generation of fileservers will not be based on Illumos
Our current generation of ZFS NFS fileservers are based on OmniOS. We've slowly been working on the design of our next generation for the past few months, and one of the decisions we've made is that unless something really unusual happens, we won't be using any form of Illumos as the base operating system. While we're going to continue using ZFS, we'll be basing our fileservers on either ZFS on Linux or FreeBSD (preferably ZoL, because we already run lots of Linux machines and we don't have any FreeBSD ones).
This is not directly because of uncertainties around OmniOS CE's future (or the then lack of a LTS release that I wrote about here, because it now has one). There is really no single cause that could change our minds if it was fixed or changed; instead there are multiple contributing factors. Ultimately we made our decision because we are not in love with OmniOS and we no longer think we need to run it in order to get what we really want, which is ZFS with solid NFS fileservice.
However, I feel I need to mention some major contributing factors. The largest single factor is our continued lack of confidence in Illumos's support for Intel 10G-T chipsets. As far as I can tell from the master Illumos source, nothing substantial has changed here since back in 2014, and certainly I don't consider it a good sign that the ixgbe driver still does kernel busy-waits for milliseconds at a time. We consider 10G-T absolutely essential for our next generation of fileservers and we don't want to take chances.
(If you want to see how those busy-waits happens, look at the
msec_delay in ixgbe_osdep.h.
drv_usecwait is specifically defined to busy-wait; it's designed
to be used for microsecond durations, not millisecond ones.)
Another significant contributing factor is our frustrations with OmniOS's KYSTY minimalism, which makes dealing with our OmniOS machines more painful than dealing with our Linux ones (even the Linux ones that aren't Ubuntu based). And yes, having differently named commands does matter. It's possible that another Illumos based distribution could do better here, but I don't think there's a better one for our needs and it would still leave us with our broad issues with Illumos.
It's undeniable that we have more confidence in Linux on the whole than we do in Illumos. Linux is far more widely and heavily used, generally supports more hardware (and does so more promptly), and we've already seen that Intel 10G-T cards work fine in it (we have them in a number of our existing Linux machines, where they run great). Basically the only risk area is ZFS on Linux, and we have FreeBSD as a fallback.
There are some aspects of OmniOS that I will definitely miss, most notably DTrace. Modern Linux may have more or less functional equivalents, but I don't think there's anything that's half as usable. However on the whole I have no sentimental attachments to Solaris or Illumos; I don't hate it, but I won't miss it on the whole and an all-Linux environment will make my life simpler.
(This decision is only partly related to our decision not to use a SAN in the next generation of fileservers. While we could probably use OmniOS with the local disk setup that we want, not having to worry about Illumos's hardware support for various controller hardware does make our lives simpler.)
Sequential scrubs and resilvers are coming for (open-source) ZFS
Oracle has made a number of changes and improvements to Solaris ZFS since they took it closed source. Mostly I've been indifferent to their changes, but the one improvement I've long envied is their sequential resilvering (and scrubbing) (this apparently first appeared in Solaris 11.2, per here and here). That ZFS scrubs and resilvers aren't sequential has long been a quiet pain point for a lot of people. Apparently it's especially bad for RAID-Z pools (perhaps because of the usual RAID-Z random read issue), but it's been an issue for us in the past with mirrors (although we managed to speed that up).
Well, there's great news here for all open source ZFS implementations, including Illumos distributions, because an implementation of sequential scrubs and resilvers just landed in ZFS on Linux in this commit (apparently it'll be included in ZoL 0.8 whenever that's released). The ZFS on Linux work was done by Tom Caputi of Datto, building on work done by Saso Kiselkov of Nexenta. Saso Kiselkov's work was presented at the 2016 OpenZFS developer summit and got an OpenZFS wiki summary page; Tom Caputi presented at the 2017 summit. Both have slides (and talk videos) if you want more information on how this works.
(It appears that the Nexenta work may be 'NEX-6068', included in NexentaStor 5.0.3. I can't find a current public source tree for Nexenta, so I don't know anything more than that.)
For how it works, I'll just quote from the commit message:
This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance.
My early experience with this in the current ZoL git tree has been very positive. I saw a single-vdev mirror pool on HDs with 293 GB used go from a scrub time of two hours and 25 minutes to one hour and ten minutes.
Although this is very early days for this feature even in ZFS on Linux, I'd expect it to get pushed (or pulled) upstream later and thus go into Illumos. I have no idea when that might happen; it might be reasonable to wait until ZFS on Linux has included it in an actual release so that it sees some significant testing in the field. Or people could find this an interesting and important enough change that they actively work to bring it upstream, if only for testing there.
(At this point I haven't spotted any open issues about this in the Illumos issue tracker, but as mentioned I don't really expect that yet unless someone wants to get a head start.)
PS: Unlike Oracle's change for Solaris 11.2, which apparently needed a pool format change (Oracle version 35, according to Wikipedia), the ZFS on Linux implementation needs no new pool feature and so is fully backward compatible. I'd expect this to be true for any eventual Illumos version unless people find some hard problem that forces the addition of a new pool feature.