What ZFS messages about 'permanent errors in <0x95>:<0x0>' mean
If you use ZFS long enough (or are unlucky enough), one of the things
you may run into are reports in
zpool status -v of permanent errors
in something (we've had that happen to us despite redundancy). If you're reasonably lucky, the error message
will have a path in it. If you're unlucky, the error message will say
errors: Permanent errors have been detected in the following files:<0x95>:<0x0>
The short answer of what they mean is, to quote directly:
The first number is the dataset id (index) and the second is the object id. For filesystems, the object id can be the same as the file's "inode" as shown by "ls -i" But a few obect ids exist for all datasets. Object id 0 is the DMU dnode.
The dataset here may be a ZFS filesystem, a snapshot, or I believe a few other things. I believe that if it's still in existence, you'll normally get at least its name and perhaps the full path to the object. When it's not in existence any more (perhaps you deleted the snapshot or the whole filesystem in question since the scrub detected it), you get this hex ID and there's also no information about the path.
The reason the information is presented this way is that what the
ZFS code in the kernel saves and returns to the
zpool command is
actually just the dataset and object ID. It's up to
zpool to turn
both of these into names, which it actually does by calling back
into the kernel to find out what they're currently called, if the
kernel knows. Inspecting the relevant ZFS code
says that there are five cases:
<metadata>:<0x...>means corruption in some object in the pool's overall metadata object set.
<0x...>:<0x...>means that the dataset involved can't be identified (and thus ZFS has no hope of identifying the thing inside the dataset).
/some/path/namemeans you have a corrupted filesystem object (a file, a directory, etc) in a currently mounted dataset and this is its full current path.
(I think that ZFS's determination of the path name for a given ZFS object is pretty reliable; if I'm reading the code right, it appears to be able to scan upward in the filesystem hierarchy starting with the object itself.)
dsname:/some/pathmeans that the dataset is called
dsnamebut it's not currently mounted, and
/some/pathis the path within it. I think this happens for snapshots.
dsname:<0x...>means that it's in the given dataset
dsname(which may or may not be mounted), but the ZFS object in question can't have its path identified for various reasons (including that it's already been deleted).
Only things in ZFS filesystems (and snapshots and so on) have path names, so an error in a ZVOL will always be reported without the path. I'm not sure what the reported dataset names are for ZVOLs, since I don't use ZVOLs.
The final detail is that you may see this error status in '
-v' even after you've cleaned it up. To quote Richard Elling again:
Finally, the error buffer for "zpool status" contains information for two scan passes: the current and previous scans. So it is possible to delete an object (eg file) and still see it listed in the error buffer. It takes two scans to completely update the error buffer. This is important if you go looking for a dataset+object tuple with zdb and don't find anything...
PS: There are some cases where
<xattrdir> will appear in the file
path. If I'm reading the code correctly, this happens when the
problem is in an extended attribute instead of the filesystem object
PPS: Richard Elling's message was on the ZFS on Linux mailing list and about an issue someone was having with a ZoL system, but as far as I can see the core code is basically the same in Illumos and I would expect in FreeBSD as well, so this bit of ZFS wisdom should be cross-platform.
ZFS pushes file renamings and other metadata changes to disk quite promptly
One of the general open questions on Unix is when changes like
renaming or creating files are actually durably on disk. Famously,
some filesystems on some Unixes have been willing to delay this for
an unpredictable amount of time unless you did things like
the containing directory of your renamed file, not just
the file itself. As it happens, ZFS's design means that it offers
some surprisingly strong guarantees about this; specifically, ZFS
persists all metadata changes to disk no later than the next
transaction group commit. In ZFS today, a transaction group commit
generally happens every five seconds, so if you do something like
rename a file, your rename will be fully durable quite soon even if
you do nothing special.
However, this doesn't mean that if you create a file, write data
to the file, and then rename it (with no other special operations)
that in five or ten seconds your new file is guaranteed to be present
under its new name with all the data you wrote. Although metadata
operations like creating and renaming files go to ZFS right away
and then become part of the next txg commit, the kernel generally
holds on to written file data for a while before pushing it out.
You need some sort of
fsync() in there to force the kernel to
commit your data, not just your file creation and renaming. Because
of how the ZFS intent log works, you don't need
to do anything more than
fsync() your file here; when you
a file, all pending metadata changes are flushed out to disk along
with the file data.
(In a 'create new version, write, rename to overwrite current
version' setup, I think you want to
fsync() the file twice, once
after the write and then once after the rename. Otherwise you haven't
necessarily forced the rename itself to be written out. You don't
want to do the rename before a
fsync(), because then I think that
a crash at just the wrong time could give you an empty new file.
But the ice is thin here in portable code, including code that wants
to be portable to different filesystem types.)
My impression is that ZFS is one of the few filesystems with such a regular schedule for committing metadata changes to disk. Others may be much more unpredictable, and possibly may reorder the commits of some metadata operations in the process (although by now, it would be nice if everyone avoided that particular trick). In ZFS, not only do metadata changes commit regularly, but there is a strict time order to them such that they can never cross over each other that way.
spare-N spare vdevs in your pool are mirror vdevs
Here's something that comes up every so often in ZFS and is not as well publicized as perhaps it should be (I most recently saw it here). Suppose that you have a pool, there's been an issue with one of the drives, and you've had a spare activate. In some situations, you'll wind up with a pool configuration that may look like this:
[...] wwn-0x5000cca251b79b98 ONLINE 0 0 0 spare-8 ONLINE 0 0 0 wwn-0x5000cca251c7b9d8 ONLINE 0 0 0 wwn-0x5000cca2568314fc ONLINE 0 0 0 wwn-0x5000cca251ca10b0 ONLINE 0 0 0 [...]
What is this
spare-8 thing, beyond 'a sign that a spare activated
here'? This is sometimes called a 'spare vdev', and the answer is
that spare vdevs are mirror vdevs.
Yes, I know, ZFS says that you can't put one vdev inside another vdev and these spare-N vdevs are inside other vdevs. ZFS is not exactly wrong, since it doesn't let you and me do this, but ZFS itself can break its own rules and it's doing so here. These really are mirror vdevs under the surface and as you'd expect they're implemented with exactly the same code in the ZFS kernel code.
(If you're being sufficiently technical these are actually a slightly different type of mirror vdev, which you can see being defined in vdev_mirror.c. But while they have different nominal types they run the same code to do various operations. Admittedly, there are some other sections in the ZFS code that check to see whether they're operating on a real mirror vdev or a spare vdev.)
What this means is that these
spare-N vdevs behave like mirror
vdevs. Assuming that both sides are healthy, reads can be satisfied
from either side (and will be balanced back and forth as they are
for mirror vdevs), writes will go to both sides, and a scrub will
check both sides. As a result, if you scrub a pool with a
vdev and there are no problems reported for either component device,
then both old and new device are fine and contain a full and intact
copy of the data. You can keep either (or both).
As a side note, it's possible to manually create your own
vdevs even without a fault, because spares activation is actually
a user-level thing in ZFS. Although I haven't
tested this recently, you generally get a
spare-N vdev if you do
zpool replace <POOL> <ACTIVE-DISK> <NEW-DISK>' and <NEW-DISK>
is configured as a spare in the pool. Abusing this to create long
term mirrors inside raidZ vdevs is left as an exercise to the reader.
(One possible reason to have a relatively long term mirror inside a raidZ vdev is if you don't entirely trust one disk but don't want to pull it immediately, and also have a handy spare disk. Here you're effectively pre-deploying a spare in case the first disk explodes on you. You could also do the same if you don't entirely trust the new disk and want to run it in parallel before pulling the old one.)
PS: As you might expect, the
replacing-N vdev that you get when
you replace a disk is also a mirror vdev, with the special behavior
than when the resilver finishes, the original device is normally
An interaction of low ZFS
recordsize, compression, and advanced format disks
Suppose that you have something with a low ZFS
classical example is zvols, where people often use an 8 Kb
volblocksize. You have compression turned on, and you are using
a pool (or vdev) with
ashift=12 because it's on 'advanced format' drives or you're preparing for that
possibility. This seems especially likely on SSDs, some of which
are already claiming to be 4K physical sector drives.
In this situation, you will probably get much lower compression ratios than you expect, even with reasonably compressible data. There are two reasons for this, the obvious one and the inobvious one. The obvious one is that ZFS compresses each logical block separately, and your logical blocks are small. Generally the larger the things you compress at once, the better most compression algorithms do, up to a reasonable size; if you use a small size, you get not as good results and less compression.
lz4 command line compression program doesn't even have an
option to compress in less than 64 Kb blocks (cf),
which shows you what people think of the idea. The lz4 algorithm
can be applied to smaller blocks, and ZFS does, but presumably the
results are not as good.)
The inobvious problem is how a small
recordsize interacts with a
large physical block size (ie, a large
ashift). In order to save
any space on disk, compression has to shrink the data enough so
that it uses fewer disk blocks. With 4 Kb disk blocks (an
of 12), this means you need to compress things down by at least 4
Kb; when you're starting with 8 Kb logical blocks because of your
recordsize, this means you need at least 50% compression in
order to save any space at all. If your data is compressible but
not that compressible, you can't save any allocated space.
recordsize gives you more room to at least save some
space. With a 128 Kb
recordsize, you need only compress a bit (to
120 Kb, about 7% compression) in order to save one 4 Kb disk block.
Further increases in compression can get you more savings, bit by
bit, because you have more disk blocks to shave away.
ashift=9 pool similarly gives you more room to get wins from
compression because you can save space in 512 byte increments,
instead of needing to come up with 4 Kb of space savings at a time.)
(Writing this up as an entry was sparked by this ZFS lobste.rs discussion.)
PS: I believe that this implies that if your
volblocksize) is the same as the disk physical block size (or
ashift size), compression will never do anything for you. I'm not
sure if ZFS will even try to run the compression code or if it will
silently pretend that you have
ZFS's recordsize as an honest way of keeping checksum overhead down
One of the classical tradeoffs of using checksums to verify the integrity of something (as ZFS does) is the choice of how large a chunk of data to cover with a single checksum. A large chunk size keeps the checksum overhead down, but it means that you have to process a large amount of data at once in order to verify or create the checksum. A large size also limits how specific you can be about what piece of data is damaged, which is important if you want to be able to recover some of your data.
(Recovery has two aspects. One is simply giving you access to as much of the undamaged data as possible. The other is how much data you have to process in order to heal corrupted data using various redundancy schemes. If you checksum over 16 Kbyte chunks and you have a single corrupted byte in a 1 Mbyte file, you can read 1008 Kbytes immediately and you only have to process the span of 16 Kbytes of data to recover from the corruption. If you checksum over 1 Mbyte chunks and have the same corruption, the entire file is unreadable and you're processing the span of 1 Mbyte of data to recover.)
If you're serious about checksums, you have to verify them on read and always create and update them on writes. This means that you have to operate on the entire checksum chunk size for these operations (even on partial chunk updates, depending on the checksum algorithm). Regardless of how the data is stored on disk, you to have all of the chunk available in memory to compute and recompute the checksum. So if you want to have a relatively large checksum chunk size in order to keep overhead down, you might as well make this your filesystem block size, because you're forced to do a lot of IO in the checksum chunk size no matter what.
This is effectively what ZFS does for files that have grown to their
full recordsize. The checksum chunk size is
recordsize and so is
the filesystem (logical) block size; ZFS stores one checksum for
recordsize chunk of the file (well, that actually exists).
This keeps the overhead of checksums down nicely, and setting the
logical filesystem block size to the checksum chunk size is honest
about what IO is actually happening (especially in a copy on write
If the ZFS logical block size was always
recordsize, this could
be a serious problem for small files. Ignoring compression, they would allocate far more space
than they needed, creating huge amounts of inefficiency (you could
have a 4 Kbyte file that had to allocate 128 Kbytes of disk space).
So instead ZFS has what is in effect a variable checksum chunk size
for small files, and with it a variable logical block size, in order
to store such files reasonably efficiently. As we've seen, ZFS works fairly hard to only
store the minimum amount of data it has to for small files (which
it defines as files below
(This model of why ZFS
recordsize exists and operates the way it
does didn't occur to me until I wrote yesterday's entry, but now that it has, I think I may finally
have the whole thing sorted out in my head.)
Thinking about why ZFS only does IO in
recordsize blocks, even random IO
As I wound up experimentally verifying,
in ZFS all files are stored as a single block of varying size up
to the filesystem's
recordsize, or using multiple
blocks. As is perhaps less well known, a ZFS logical block is the minimum size of IO to a
file, both for reads and especially for writes. Since the default
recordsize is 128 Kb, this means that many files of interest are
recordsize blocks and thus all IO to them is done in 128 Kb
units, even if you're only reading or writing a small amount of
On the one hand, this seems a little bit crazy. The time it takes to transfer 128 Kb over a SATA link is not always something that you can ignore, and on SSDs larger writes can have a real impact. On the other hand, I think that this choice is more or less forced by some decisions that ZFS has made. Specifically, the ZFS checksum covers the entire logical block, and ZFS's data structure for 'where you find things on disk' is also based on logical blocks.
I wrote before about the ZFS DVA, which
is ZFS's equivalent of a block number and tells you where to find
data. ZFS DVAs are embedded into 'block pointers', which you can
find described in spa.h.
One of the fields of the block pointer is the ZFS block checksum.
Since this is part of the block pointer, it is a checksum over all
of the (logical) data in the block, which is up to
recordsize. Once a file reaches
recordsize bytes long,
all blocks are the same size, the
Since the ZFS checksum is over the entire logical block, ZFS has
to fetch the entire logical block in order to verify the checksum
on reads, even if you're only asking for 4 Kbytes out of it. For
writes, even if ZFS allowed you to have different sized logical
blocks in a file, you'd need to have the original
available in order to split it and you'd have to write all of it
back out (both because ZFS never overwrites in place and because
the split creates new logical blocks, which need new checksums).
Since you need to add new logical blocks, you might have a ripple
effect in ZFS's equivalent of indirect blocks, where they must
expand and shuffle things around.
(If you're not splitting the logical block when you write to only a part of it, copy on write means that there's no good way to do this without rewriting the entire block.)
In fact, the more I think about this, the more it seems that having multiple (logical) block sizes in a single file would be the way to madness. There are so many things that get complicated if you allow variable block sizes. These issues can be tackled, but it's simpler not to. ZFS's innovation is not that it insists that files have a single block size, it is that it allows this block size to vary. Most filesystems simply set the block size to, say, 4 Kbytes, and live with how large files have huge indirect block tables and other issues.
(The one thing that might make ZFS nicer in the face of some access
patterns where this matters is the ability to set the
on a per-file basis instead of just a per-filesystem basis. But I'm
not sure how important this would be; the kind of environments where
it really matters are probably already doing things like putting
database tables on their own filesystems anyway.)
PS: This feels like an obvious thing once I've written this entry
all the way through, but the ZFS
recordsize issue has been one
of my awkward spots for years, where I didn't really understand why
it all made sense and had to be the way it was.
PPS: All of this implies that if ZFS did split logical blocks when you did a partial write, the only time you'd win would be if you then overwrote what was now a single logical block a second time. For example, if you created a big file, wrote 8 Kb to a spot in it (splitting a 128 Kb block into several new logical blocks, including an 8 Kb one for the write you just did), then later wrote exactly 8 Kb again to exactly that spot (overwriting only your new 8 Kb logical block). This is probably obvious too but I wanted to write it out explicitly, if only to convince myself of the logic.
Some exciting ZFS features that are in OmniOS CE's (near) future
I recently wrote about how much better ZFS pool recovery is coming, which reported on Pavel Zakharov's Turbocharging ZFS Data Recovery. In that, Zakharov said that the first OS to get it would likely be OmniOS CE, although he didn't have a timeline. Since I just did some research on this, let's run down some exciting ZFS features that are almost certainly in OmniOS CE's near future, and where they are.
There are two big ZFS features from Delphix that have recently landed in the main Illumos tree, and a third somewhat smaller one:
- This better ZFS pool recovery, which landed as a series of changes
culminating in issue 9075
in February or so. Although I can't be sure, I believe that a
recovered pool is fully compatible with older ZFS versions,
although for major damage you're going to be copying data out
of the pool to a new one.
- The long awaited feature of shrinking
ZFS pools by removing vdevs, which landed as issue 7614 in January. Using this will
add a permanent feature flag to your pool that makes it fully
incompatible with older ZFS versions.
- A feature for checkpointing the overall ZFS pool state before you do potentially dangerous operations and can then rewind to them, issue 9166, which landed just a few days ago. Since one of the purposes of better ZFS pool recovery is to provide (better) recovery over pool configuration changes, I suspect that this new pool checkpointing helps out with it. This makes your pool relatively incompatible with older ZFS versions while a checkpoint exist.
(Apparently Delphix was only able to push these upstream from their own code base to Illumos and OpenZFS relatively recently.)
All of these features have been pulled into the 'master' branch of the OmniOS CE repository from the main Illumos repo where they landed. Unless something unusual happens, I would expect them all to be included in the next release of OmniOS CE, which their release schedule says is to be r151026, expected some time this May. This will not be an LTS release; if you want to wait for an LTS release to have these features, you're waiting until next year. Given the likely magnitude of these changes and the relatively near future release of r151026, I wouldn't expect OmniOS CE to include these in a future update to the current r151024 or especially r151022 LTS.
Since OmniOS CE Bloody integrates kernel and user updates on a regular basis, I suspect that it already has many of these features and will pick up the most recent one very soon. If this is so, it gives OmniOS people a clear path if they need to recover a damaged pool; you can boot a Bloody install media or otherwise temporarily run it, repair or import the pool, possibly copying the data to another pool, and then probably revert back to running your current OmniOS with the recovered pool.
Sidebar: How to determine this sort of stuff
The most convenient way is to look at the the git log for commits that involve, say, usr/src/uts/common/fs/zfs, the kernel ZFS code, in the OmniOS CE repo. In the Github interface, this is drilling down to that directory and then picking the 'History' option; a convenient link for this for the 'master' branch is here.
Each OmniOS CE release gets its own branch in the repo, named in the obvious way, and each branch thus has its own commit history for ZFS. Here is the version for r151024. Usefully, Github forms the URLs for these things in a very predictable way, making it very easy to hand-write your own URLs for specific things (eg, the same for r151022 LTS, which shows only a few recent ZFS changes).
There's no branch for OmniOS CE Bloody, so I believe that it's simply built from the 'master' branch. It is the bleeding edge version, after all.
Much better ZFS pool recovery is coming (in open source ZFS)
One of the long standing issues in ZFS has been that while it's usually very resilient, it can also be very fragile if the wrong things get damaged. Classically, ZFS has had two modes of operation; either it would repair any damage or it would completely explode. There was no middle ground of error recovery, and this isn't a great experience; as I wrote once, panicing the system is not an error recovery strategy. In early versions of ZFS there was no recovery at all (you restored from backups); in later versions, ZFS added a feature where you could attempt to recover from damaged metadata by rewinding time, which was better than nothing but not a complete fix.
The good news is that that's going to change, and probably not too long from now. What you want to read about this is Turbocharging ZFS Data Recovery, by Pavel Zakharov of Delphix, which covers a bunch of work that he's done to make ZFS more resilient and more capable of importing various sorts of damaged pools. Of particular interest is the ability to likely recover at least some data from a pool that's lost an entire vdev. You can't get everything back, obviously, but ZFS metadata is usually replicated on multiple vdevs so losing a single vdev will hopefully leave you with enough left to at least get the rest of the data out of the pool.
All of this is really great news. ZFS has long needed better options for recovery from various pool problems, as well as better diagnostics for failed pool imports, and I'm quite happy that the situation is finally going to be improving.
The article is also interesting for its discussion of the current low level issues involved in pool importing. For example, until I read it I had no idea about how potentially dangerous a ZFS pool vdev change was due to how pool configurations are handled during the import process. I'd love to read more details on how pool importing really works and what the issues are (it's a long standing interest of mine), but sadly I suspect that no one with that depth of ZFS expertise has the kind of time it would take to write such an article.
As far as the timing of these features being available in your ZFS-using OS of choice goes, his article says this:
As of March 2018, it has landed on OpenZFS and Illumos but not yet on FreeBSD and Linux, where I’d expect it to be upstreamed in the next few months. The first OS that will get this feature will probably be OmniOS Community Edition, although I do not have an exact timeline.
If you have a sufficiently important damaged pool, under some circumstances it may be good enough if there is some OS, any OS, that can bring up the pool to recover the data in it. For all that I've had my issues with OmniOS's hardware support, OmniOS CE does have fairly decent hardware support and you can probably get it going on most modern hardware in an emergency.
(And if OmniOS can't talk directly to your disk hardware, there's
always iSCSI, as we can testify. There's
probably also other options for remote disk access that OmniOS ZFS
zdb can deal with.)
PS: If you're considering doing this in the future and your normal OS is something other than Illumos, you might want to pay attention to the ZFS feature flags you allow to be set on your pool, since this won't necessarily work if your pool uses features that OmniOS CE doesn't (yet) support. This is probably not going to be an issue for FreeBSD but might be an issue for ZFS on Linux. You probably want to compare the ZoL manpage on ZFS pool features with the Illumos version or even the OmniOS CE version.
Sidebar: Current ZFS pool feature differences
The latest Illumos tree has three new ZFS pool features from Delphix: device_removal, obsolete_counts (which enhances device removal), and zpool_checkpoint. These are all fairly recent additions; they appear to have landed in the Illumos tree this January and just recently, although the commits that implement them are dated from 2016.
ZFS on Linux has four new pool features: large_dnode, project_quota, userobj_accounting, and encryption. Both large dnodes and encryption have to be turned on explicitly, and the other two are read-only compatible, so in theory OmniOS can bring a pool up read-only even with them enabled (and you're going to want to have the pool read-only anyway).
Some things about ZFS block allocation and ZFS (file) record sizes
As I wound up experimentally verifying,
in ZFS all files are stored as a single block of varying size up
to the filesystem's
recordsize, or using multiple recordsize
blocks. For a file under the recordsize, the block size turns
out to be in a multiple of 512 bytes, regardless
of the pool's
ashift or the physical sector size of the drives
the pool is using.
Well, sort of. While everything I've written is true, it also turns out to be dangerously imprecise (as I've seen before). There are actually three different sizes here and the difference between them matters once we start getting into the fine details.
To talk about these sizes, I'll start with some illustrative
output for a file data block, as before:
0 L0 DVA=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]
The first size of the three is the logical block size, before
compression. This is the first
size= number ('4200L' here, in hex
and L for logical). This is what grows in 512-byte units up to the
recordsize and so on.
The second size is the physical size after compression, if any;
this is the second
size= number ('4200P' here, P for physical).
It's a bit weird. If the file can't be compressed, it is the same
as the logical size and because the logical size goes in 512-byte
units, so does this size, even on
ashift=12 pools. However, if
compression happens this size appears to go by the
means it doesn't necessarily go in 512-byte units. On an
pool you'll see it go in 512-byte units (so you can have a compressed
size of '400P', ie 1 KB), but the same data written in an
pool winds up being in 4 Kb units (so you wind up with a compressed
size of '1000P', ie 4 Kb).
The third size is the actual allocated size on disk, as recorded
in the DVA's asize field (which
is the third subfield in the
DVA portion). This is always in
ashift-based units, even if the physical size is not. Thus you
can wind up with a 20 KB DVA but a 16.5
KB 'physical' size, as in our example (the DVA is '5000' while the
block physical size is '4200').
(I assume this happens because ZFS insures that the physical size is never larger than the logical size, although the DVA allocated size may be.)
For obvious reasons, it's the actual allocated size on disk (the DVA asize) that matters for things like rounding up raidz allocation to N+1 blocks, fragmentation, and whether you need to use a ZFS gang block. If you write a 128 KB (logical) block that compresses to a 16 KB physical block, it's 16 KB of (contiguous) space that ZFS needs to find on disk, not 128 KB.
On the one hand, how much this matters depends on how compressible your data is and much modern data isn't (because it's already been compressed in its user-level format). On the other hand, as I found out, 'sparse' space after the logical end of file is very compressible. A 160 KB file on a standard 128 KB recordsize filesystem takes up two 128 KB logical blocks, but the second logical block has 96 KB of nothingness at the end and that compresses down to almost nothing.
PS: I don't know if it's possible to mix vdevs with different
ashifts in the same pool. If it is, I don't know how ZFS would
ashift to use for the physical block size. The minimum
ashift in any vdev? The maximum
(This is the second ZFS entry in a row where I thought I knew what was going on and it was simple, and then discovered that I didn't and it isn't.)
A surprise in how ZFS grows a file's record size (at least for me)
As I wound up experimentally verifying,
in ZFS all files are stored as a single block of varying size up
to the filesystem's
recordsize, or using multiple
blocks. If a file has more than one block, all blocks are
no more and no less. If a file is a single block, the size of this
block is based on how much data has been written to the file (or
technically the maximum offset that's been written to the file).
how the block size grows as you write data to the file turns out
to be somewhat surprising (which makes me very glad that I actually
did some experiments to verify what I thought I knew before I wrote
this entry, because I was very wrong).
Rather than involving the
ashift or growing in powers of two,
ZFS always grows the (logical) block size in 512-byte chunks
until it reaches the filesystem
recordsize. The actual physical
space allocated on disk is in
ashift sized units, as you'd expect,
but this is not directly related to the (logical) block size used
at the file level. For example, here is a 16896 byte file (of
incompressible data) on an
Object lvl iblk dblk dsize dnsize lsize %full type 4780566 1 128K 16.5K 20K 512 16.5K 100.00 ZFS plain file [...] 0 L0 DVA=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]
The DVA records an 0x5000 byte allocation (20 Kb), but the logical and physical-logical size are only 0x4200 bytes (16.5 Kb).
In thinking about it, this makes a certain amount of sense because
ashift is really a vdev property, not a pool property, and
can vary from vdev to vdev within a single pool. As a result, the
actual allocated size of a given block may vary from vdev to vdev
(and a block may be written to multiple vdevs if you have
set to more than 1 or it's metadata). The file's current block size
thus can't be based on the
ashift, because ZFS doesn't necessarily
have a single
ashift to base it on; instead ZFS bases it on 512-byte
sectors, even if this has to be materialized differently on different
Looking back, I've already sort of seen this with ZFS compression. As you'd expect, a file's (logical) block size is based on its uncompressed size, or more exactly on the highest byte offset in the file. You can write something to disk that compresses extremely well, and it will still have a large logical block size. Here's an extreme case:
; dd if=/dev/zero of=testfile bs=128k count=1 [...] # zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile Object lvl iblk dblk dsize dnsize lsize %full type 956361 1 128K 128K 0 512 128K 0.00 ZFS plain file [...]
This turns out to have no data blocks allocated at all, because the 128 Kb of zeros can be recorded entirely in magic flags in the dnode. But it still has a 128 Kb logical block size. 128 Kb of the character 'a' does wind up requiring a DVA allocation, but the size difference is drastic:
Object lvl iblk dblk dsize dnsize lsize %full type 956029 1 128K 128K 1K 512 128K 100.00 ZFS plain file [...] 0 L0 DVA=<0:3bbd1c00:400> [L0 ZFS plain file] [...] size=20000L/400P [...]
We have a compressed size of 1 Kb (and a 1 Kb allocation on disk,
as this is an
ashift=9 vdev), but once again the file block size
is 128 Kb.
(If we wrote 127.5 Kb of 'a' instead, we'd wind up with a file block size of 127.5 Kb. I'll let interested parties do that experiment themselves.)
What this means is that ZFS has much less wasted space than I thought
it did for files that are under the
recordsize. Since such files
grow their logical block size in 512-byte chunks, even with no
compression they waste at most almost all of one physical block on
disk (if you have a file that is, say, 32 Kb plus one byte, you'll
have a physical block on disk with only one byte used). This has
some implications for other areas of ZFS, but those are for another
(This is one of those entries that I'm really glad that I decided to write. I set out to write it as a prequel to another entry just to have how ZFS grew the block size of files written down explicitly, but wound up upending my understanding of the whole area. The other lesson for me is that verifying my understanding with experiments is a really good idea, because every so often my folk understanding is drastically wrong.)